请 [注册] 或 [登录]  | 返回主站

量化交易吧 /  数理科学 帖子:3365785 新帖:17

重温pandas

SCSDV_d发表于:5 月 9 日 21:53回复(1)

这几天把pandas重温了一下,结合聚宽数据,感觉更好理解聚宽的用法。

import pandas as pd import numpy as npimport matplotlib.pyplot as plt

一直没有系统的学习过pandas,正好看到一篇文章,结合聚宽再捋一遍。


1.   Series¶

1.1   创建一个 Series 的基本语法如下:¶

my_series = pd.Series(data,index)¶

上面的 data 参数可以是任意数据对象,比如字典、列表甚至是 NumPy 数组,而index 参数则是对 data 的索引值,类似字典的 key。¶

1.1.1创建一个Series,data是数字,索引是字符串¶

guojia = ['usa','ru','cn','jp']qty = [100,300,500,200]
ser = pd.Series(qty,guojia)
ser
usa    100
ru     300
cn     500
jp     200
dtype: int64
ser2 = pd.Series(guojia,qty)
ser2
100    usa
300     ru
500     cn
200     jp
dtype: object

以上,括号前面的是Series的values,后面的是index¶

注意:请记住, index 参数是可省略的,你可以选择不输入这个参数。如果不带 index 参数,Pandas 会自动用默认 index 进行索引,类似数组,索引值是 [0, ..., len(data) - 1]¶

如果你从一个 Python 字典对象创建 Series,Pandas 会自动把字典的键值设置成 Series 的 index,并将对应的 values 放在和索引对应的 data 里。¶

和 NumPy 数组不同,Pandas 的 Series 能存放各种不同类型的对象。¶

1.1.2从 Series 里获取数据¶

访问 Series 里的数据的方式,和 Python 字典基本一样:¶

ser
usa    100
ru     300
cn     500
jp     200
dtype: int64
ser['ru']
300
ser['jp']
200

1.1.3 对 Series 进行算术运算操作¶

对 Series 的算术运算都是基于 index 进行的。我们可以用加减乘除(+ - * /)这样的运算符对两个 Series 进行运算,Pandas 将会根据索引 index,对响应的数据进行计算,结果将会以浮点数的形式存储,以避免丢失精度。¶

ser = pd.Series(qty,guojia)
guojia2 =['au','cn','jp','usa']qty2 =[200,300,400,598]
ser3 = pd.Series(qty2,guojia2)
ser3
au     200
cn     300
jp     400
usa    598
dtype: int64
ser - ser3
au       NaN
cn     200.0
jp    -200.0
ru       NaN
usa   -498.0
dtype: float64
ser + ser3
au       NaN
cn     800.0
jp     600.0
ru       NaN
usa    698.0
dtype: float64

Series基于索引进行运算,我们不必理会对齐的问题,系统自动将索引相同的值进行运算;如果一个series缺少某个索引,其运算结果为 NaN¶

2. DataFrames¶

Pandas 的 DataFrame(数据表)是一种 2 维数据结构,数据以表格的形式存储,分成若干行和列。通过 DataFrame,你能很方便地处理数据。常见的操作比如选取、替换行或列的数据,还能重组数据表、修改索引、多重筛选等。¶

2.1构建一个 DataFrame 对象的基本语法如下:¶

df = get_price('RB9999.XSGE',start_date='2018-12-1', end_date='2019-3-1',fields=None)
df.head()

.dataframe tbody tr th:only-of-type {        vertical-align: middle;    }    .dataframe tbody tr th {        vertical-align: top;    }    .dataframe thead th {        text-align: right;    }


openclosehighlowvolumemoney
2018-12-033250.03333.03495.03208.05092070.01.696123e+11
2018-12-043306.03374.03388.03283.04807696.01.600262e+11
2018-12-053389.03463.03466.03362.05400890.01.840964e+11
2018-12-063470.03375.03473.03371.04500922.01.535990e+11
2018-12-073360.03397.03426.03334.05127774.01.729243e+11
df['open'].head()
2018-12-03    3250.0
2018-12-04    3306.0
2018-12-05    3389.0
2018-12-06    3470.0
2018-12-07    3360.0
Name: open, dtype: float64
df['volume'].tail()
2019-02-25    3659076.0
2019-02-26    3998202.0
2019-02-27    2948082.0
2019-02-28    3461158.0
2019-03-01    2986416.0
Name: volume, dtype: float64

看,上面表中的每一列基本上就是一个 Series ,它们都用了同一个 index。因此,我们基本上可以把 DataFrame 理解成一组采用同样索引的 Series 的集合。¶

2.2 获取 DataFrame 中的列 <  color = red  >¶

要获取一列的数据,还是用中括号 [] 的方式,跟 Series 类似。比如尝试获取上面这个表中的 open 列数据:¶

df['open'].head()
2018-12-03    3250.0
2018-12-04    3306.0
2018-12-05    3389.0
2018-12-06    3470.0
2018-12-07    3360.0
Name: open, dtype: float64

2.2.1 因为我们只获取一列,所以返回的就是一个 Series。可以用 type() 函数确认返回值的类型¶

type(df['open'])
pandas.core.series.Series

2.2.2 如果获取多个列,那返回的就是一个 DataFrame 类型:¶

df2 =df[['open','volume']].head()
df2

.dataframe tbody tr th:only-of-type {        vertical-align: middle;    }    .dataframe tbody tr th {        vertical-align: top;    }    .dataframe thead th {        text-align: right;    }


openvolume
2018-12-033250.05092070.0
2018-12-043306.04807696.0
2018-12-053389.05400890.0
2018-12-063470.04500922.0
2018-12-073360.05127774.0
type(df2)
pandas.core.frame.DataFrame
 

2.3 向 DataFrame 里增加数据列¶

创建一个列的时候,你需要先定义这个列的数据和索引。¶

增加数据列有两种办法:可以从头开始定义一个 pd.Series,再把它放到表中,也可以利用现有的列来产生需要的新列。比如下面两种操作:¶

2.3.1  定义一个 Series ,并放入 'open-close' 列中: 新series的值是“NaN"¶

df['o-c'] =pd.Series()
df.head()

.dataframe tbody tr th:only-of-type {        vertical-align: middle;    }    .dataframe tbody tr th {        vertical-align: top;    }    .dataframe thead th {        text-align: right;    }


openclosehighlowvolumemoneyo-c
2018-12-033250.03333.03495.03208.05092070.01.696123e+11NaN
2018-12-043306.03374.03388.03283.04807696.01.600262e+11NaN
2018-12-053389.03463.03466.03362.05400890.01.840964e+11NaN
2018-12-063470.03375.03473.03371.04500922.01.535990e+11NaN
2018-12-073360.03397.03426.03334.05127774.01.729243e+11NaN

2.3.3 从现有的列创建新列¶

直接进行运算¶

df['o-c2'] = df['open'] - df['close']
df.head()

.dataframe tbody tr th:only-of-type {        vertical-align: middle;    }    .dataframe tbody tr th {        vertical-align: top;    }    .dataframe thead th {        text-align: right;    }


openclosehighlowvolumemoneyo-co-c2
2018-12-033250.03333.03495.03208.05092070.01.696123e+11NaN-83.0
2018-12-043306.03374.03388.03283.04807696.01.600262e+11NaN-68.0
2018-12-053389.03463.03466.03362.05400890.01.840964e+11NaN-74.0
2018-12-063470.03375.03473.03371.04500922.01.535990e+11NaN95.0
2018-12-073360.03397.03426.03334.05127774.01.729243e+11NaN-37.0
 
 
def cmp(df1,df2):cmp = df1.values - df2.values#     if cmp > 0:#         cmp1 = True#     else:#         cmp1 = Falsereturn cmp
df['cmp'] = cmp(df['open'],df['close'])
df.head()

.dataframe tbody tr th:only-of-type {        vertical-align: middle;    }    .dataframe tbody tr th {        vertical-align: top;    }    .dataframe thead th {        text-align: right;    }


openclosehighlowvolumemoneycmp
2018-12-033250.03333.03495.03208.05092070.01.696123e+11-83.0
2018-12-043306.03374.03388.03283.04807696.01.600262e+11-68.0
2018-12-053389.03463.03466.03362.05400890.01.840964e+11-74.0
2018-12-063470.03375.03473.03371.04500922.01.535990e+1195.0
2018-12-073360.03397.03426.03334.05127774.01.729243e+11-37.0

2.4  从 DataFrame 里删除行/列¶

2.4.1 想要删除某一行或一列,可以用 .drop() 函数。在使用这个函数的时候,你需要先指定具体的删除方向,axis=0 对应的是行 row,而 axis=1 对应的是列 column 。¶

df.head()

.dataframe tbody tr th:only-of-type {        vertical-align: middle;    }    .dataframe tbody tr th {        vertical-align: top;    }    .dataframe thead th {        text-align: right;    }


openclosehighlowvolumemoneyo-co-c2
2018-12-033250.03333.03495.03208.05092070.01.696123e+11NaN-83.0
2018-12-043306.03374.03388.03283.04807696.01.600262e+11NaN-68.0
2018-12-053389.03463.03466.03362.05400890.01.840964e+11NaN-74.0
2018-12-063470.03375.03473.03371.04500922.01.535990e+11NaN95.0
2018-12-073360.03397.03426.03334.05127774.01.729243e+11NaN-37.0

要删除‘o-c’这一列,用axis=1,代表column¶

df.head().drop('o-c',axis=1)

.dataframe tbody tr th:only-of-type {        vertical-align: middle;    }    .dataframe tbody tr th {        vertical-align: top;    }    .dataframe thead th {        text-align: right;    }


openclosehighlowvolumemoneyo-c2
2018-12-033250.03333.03495.03208.05092070.01.696123e+11-83.0
2018-12-043306.03374.03388.03283.04807696.01.600262e+11-68.0
2018-12-053389.03463.03466.03362.05400890.01.840964e+11-74.0
2018-12-063470.03375.03473.03371.04500922.01.535990e+1195.0
2018-12-073360.03397.03426.03334.05127774.01.729243e+11-37.0

请务必记住,除非用户明确指定,否则在调用 .drop() 的时候,Pandas 并不会真的永久性地删除这行/列。这主要是为了防止用户误操作丢失数据。

你可以通过调用 df 来确认数据的完整性。如果你确定要永久性删除某一行/列,你需要加上 inplace=True 参数,比如:

重新打印df

df.head()

.dataframe tbody tr th:only-of-type {        vertical-align: middle;    }    .dataframe tbody tr th {        vertical-align: top;    }    .dataframe thead th {        text-align: right;    }


openclosehighlowvolumemoneyo-co-c2
2018-12-033250.03333.03495.03208.05092070.01.696123e+11NaN-83.0
2018-12-043306.03374.03388.03283.04807696.01.600262e+11NaN-68.0
2018-12-053389.03463.03466.03362.05400890.01.840964e+11NaN-74.0
2018-12-063470.03375.03473.03371.04500922.01.535990e+11NaN95.0
2018-12-073360.03397.03426.03334.05127774.01.729243e+11NaN-37.0

永久删除 inplace =True  .....(报错了,以后再学)

df.head().drop('o-c',axis=1,inplace=True )
/opt/conda/lib/python3.6/site-packages/pandas/core/frame.py:3697: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame

See the c*eats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  errors=errors)

2.5  获取 DataFrame 中的一行或多行数据¶

要获取某一行,你需要用 .loc[] 来按索引(标签名)引用这一行,或者用 .iloc[],按这行在表中的位置(行数)来引用。¶

df.head()

.dataframe tbody tr th:only-of-type {        vertical-align: middle;    }    .dataframe tbody tr th {        vertical-align: top;    }    .dataframe thead th {        text-align: right;    }


openclosehighlowvolumemoneyo-co-c2
2018-12-033250.03333.03495.03208.05092070.01.696123e+11NaN-83.0
2018-12-043306.03374.03388.03283.04807696.01.600262e+11NaN-68.0
2018-12-053389.03463.03466.03362.05400890.01.840964e+11NaN-74.0
2018-12-063470.03375.03473.03371.04500922.01.535990e+11NaN95.0
2018-12-073360.03397.03426.03334.05127774.01.729243e+11NaN-37.0
df.head().loc['2018-12-05']
open      3.389000e+03
close     3.463000e+03
high      3.466000e+03
low       3.362000e+03
volume    5.400890e+06
money     1.840964e+11
o-c                NaN
o-c2     -7.400000e+01
Name: 2018-12-05 00:00:00, dtype: float64
type(df.head().loc['2018-12-07'])
pandas.core.series.Series

输入:  df.head(10).iloc[[1,9]] 试一试¶

用 .loc[] 来按索引(标签名)引用这一行¶

df.head(10).iloc[[1,9]]

.dataframe tbody tr th:only-of-type {        vertical-align: middle;    }    .dataframe tbody tr th {        vertical-align: top;    }    .dataframe thead th {        text-align: right;    }


openclosehighlowvolumemoneyo-co-c2
2018-12-043306.03374.03388.03283.04807696.01.600262e+11NaN-68.0
2018-12-143416.03445.03450.03405.04124206.01.413454e+11NaN-29.0

同时你可以用 .loc[] 来指定具体的行列范围,并生成一个子数据表,就像在 NumPy里做的一样。比如,提取 '12-08' 行中 'money’,'open'等1列或多 列的内容,可以如下操作¶

df.head().loc['2018-12-07',['money']]
money    1.729243e+11
Name: 2018-12-07 00:00:00, dtype: float64
 
df.head().loc['2018-12-07',['money','high','open']]
money    1.729243e+11
high     3.426000e+03
open     3.360000e+03
Name: 2018-12-07 00:00:00, dtype: float64
type(df.head().loc['2018-12-05',['open','high','volume']])
pandas.core.series.Series

iloc 直接写索引内容,报错¶

因为    : 用 .iloc[],按这行在表中的位置(行数)来引用。¶

df.head(10).iloc[['2018-12-04','2018-12-14']]
-ValueError                                Traceback (most recent call last)<ipython-input-73-74b1efb5eb04> in <module>> 1 df.head(10).iloc[['2018-12-04','2018-12-14']]/opt/conda/lib/python3.6/site-packages/pandas/core/indexing.py in __getitem__(self, key)   1476    1477             maybe_callable = com._apply_if_callable(key, self.obj)-> 1478             return self._getitem_axis(maybe_callable, axis=axis)   1479    1480     def _is_scalar_access(self, key):/opt/conda/lib/python3.6/site-packages/pandas/core/indexing.py in _getitem_axis(self, key, axis)   2089         # a list of integers   2090         elif is_list_like_indexer(key):-> 2091             return self._get_list_axis(key, axis=axis)   2092    2093         # a single integer/opt/conda/lib/python3.6/site-packages/pandas/core/indexing.py in _get_list_axis(self, key, axis)   2068             axis = self.axis or 0   2069         try:-> 2070             return self.obj._take(key, axis=axis)   2071         except IndexError:   2072             # re-raise with different error message/opt/conda/lib/python3.6/site-packages/pandas/core/generic.py in _take(self, indices, axis, is_copy)   2787         new_data = self._data.take(indices,   2788                                    axis=self._get_block_manager_axis(axis),-> 2789                                    verify=True)   2790         result = self._constructor(new_data).__finalize__(self)   2791 /opt/conda/lib/python3.6/site-packages/pandas/core/internals.py in take(self, indexer, axis, verify, convert)   4524                              dtype='int64')   4525                    if isinstance(indexer, slice)-> 4526                    else np.asanyarray(indexer, dtype='int64'))   4527    4528         n = self.shape[axis]/opt/conda/lib/python3.6/site-packages/numpy/core/numeric.py in asanyarray(a, dtype, order)    542     543     """> 544     return array(a, dtype, copy=False, order=order, subok=True)    545     546 ValueError: invalid literal for int() with base 10: '2018-12-04'
 

2.5  条件筛选¶

2.51 用中括号 [] 的方式,除了直接指定选中某些列外,还能接收一个条件语句,然后筛选出符合条件的行/列。比如,我们希望在下面这个表格中筛选出 'o-c2'>0 的行:¶

df.head()

.dataframe tbody tr th:only-of-type {        vertical-align: middle;    }    .dataframe tbody tr th {        vertical-align: top;    }    .dataframe thead th {        text-align: right;    }


openclosehighlowvolumemoneyo-co-c2
2018-12-033250.03333.03495.03208.05092070.01.696123e+11NaN-83.0
2018-12-043306.03374.03388.03283.04807696.01.600262e+11NaN-68.0
2018-12-053389.03463.03466.03362.05400890.01.840964e+11NaN-74.0
2018-12-063470.03375.03473.03371.04500922.01.535990e+11NaN95.0
2018-12-073360.03397.03426.03334.05127774.01.729243e+11NaN-37.0
df['o-c2']>0
2018-12-03    False
2018-12-04    False
2018-12-05    False
2018-12-06     True
2018-12-07    False
2018-12-10     True
2018-12-11    False
2018-12-12     True
2018-12-13    False
2018-12-14    False
2018-12-17     True
2018-12-18    False
2018-12-19     True
2018-12-20    False
2018-12-21    False
2018-12-24     True
2018-12-25     True
2018-12-26    False
2018-12-27     True
2018-12-28    False
2019-01-02     True
2019-01-03    False
2019-01-04    False
2019-01-07    False
2019-01-08     True
2019-01-09    False
2019-01-10    False
2019-01-11    False
2019-01-14    False
2019-01-15     True
2019-01-16    False
2019-01-17    False
2019-01-18    False
2019-01-21    False
2019-01-22     True
2019-01-23    False
2019-01-24    False
2019-01-25    False
2019-01-28     True
2019-01-29     True
2019-01-30     True
2019-01-31    False
2019-02-01    False
2019-02-11     True
2019-02-12     True
2019-02-13     True
2019-02-14     True
2019-02-15     True
2019-02-18    False
2019-02-19     True
2019-02-20     True
2019-02-21    False
2019-02-22    False
2019-02-25     True
2019-02-26    False
2019-02-27     True
2019-02-28    False
2019-03-01    False
Name: o-c2, dtype: bool
df[df['o-c2']>0][['open','close','money']]

.dataframe tbody tr th:only-of-type {        vertical-align: middle;    }    .dataframe tbody tr th {        vertical-align: top;    }    .dataframe thead th {        text-align: right;    }


openclosemoney
2018-12-063470.03375.01.535990e+11
2018-12-103388.03312.01.584575e+11
2018-12-123340.03317.01.404878e+11
2018-12-173448.03435.01.345986e+11
2018-12-193432.03425.01.328182e+11
2018-12-243515.03451.01.227598e+11
2018-12-253442.03398.01.473822e+11
2018-12-273418.03396.01.428099e+11
2019-01-023398.03382.07.286511e+10
2019-01-083521.03505.09.602977e+10
2019-01-153576.03519.01.105785e+11
2019-01-223650.03633.01.455658e+11
2019-01-283712.03681.01.238699e+11
2019-01-293680.03675.09.212779e+10
2019-01-303683.03677.01.671236e+11
2019-02-113850.03825.01.342309e+11
2019-02-123818.03785.01.180507e+11
2019-02-133780.03702.01.425991e+11
2019-02-143711.03684.01.205294e+11
2019-02-153680.03599.01.423635e+11
2019-02-193670.03655.01.013994e+11
2019-02-203650.03641.01.418150e+11
2019-02-253736.03682.01.363059e+11
2019-02-273737.03715.01.098979e+11

你可以用逻辑运算符 &(与)和 |(或)来链接多个条件语句,以便一次应用多个筛选条件到当前的 DataFrame 上。举个栗子,你可以用下面的方法筛选出同时满足 'o-c2'>0 和'close'> 3400 的行:比如 ‘2018-12-06’这行

df.head(20)

.dataframe tbody tr th:only-of-type {        vertical-align: middle;    }    .dataframe tbody tr th {        vertical-align: top;    }    .dataframe thead th {        text-align: right;    }


openclosehighlowvolumemoneyo-co-c2
2018-12-033250.03333.03495.03208.05092070.01.696123e+11NaN-83.0
2018-12-043306.03374.03388.03283.04807696.01.600262e+11NaN-68.0
2018-12-053389.03463.03466.03362.05400890.01.840964e+11NaN-74.0
2018-12-063470.03375.03473.03371.04500922.01.535990e+11NaN95.0
2018-12-073360.03397.03426.03334.05127774.01.729243e+11NaN-37.0
2018-12-103388.03312.03398.03303.04741366.01.584575e+11NaN76.0
2018-12-113306.03327.03341.03284.03595984.01.189109e+11NaN-21.0
2018-12-123340.03317.03379.03317.04194776.01.404878e+11NaN23.0
2018-12-133325.03419.03431.03321.05327258.01.805737e+11NaN-94.0
2018-12-143416.03445.03450.03405.04124206.01.413454e+11NaN-29.0
2018-12-173448.03435.03470.03414.03915682.01.345986e+11NaN13.0
2018-12-183430.03435.03443.03391.03671214.01.255123e+11NaN-5.0
2018-12-193432.03425.03450.03411.03871012.01.328182e+11NaN7.0
2018-12-203433.03481.03492.03427.04306708.01.488089e+11NaN-48.0
2018-12-213500.03508.03535.03471.04313258.01.511073e+11NaN-8.0
2018-12-243515.03451.03516.03435.03542690.01.227598e+11NaN64.0
2018-12-253442.03398.03448.03357.04333400.01.473822e+11NaN44.0
2018-12-263393.03409.03430.03387.03285056.01.119878e+11NaN-16.0
2018-12-273418.03396.03471.03395.04172568.01.428099e+11NaN22.0
2018-12-283387.03404.03420.03382.02915936.09.916103e+10NaN-17.0
 

df[ (df['o-c2']>0) & (df['close'] > 3400)  ]¶

df[  (  ) & ()   ]    () = df['column name']  > 0¶

验证一下()的类型¶

df[ (df['o-c2']>0) & (df['close'] > 3400)  ]

.dataframe tbody tr th:only-of-type {        vertical-align: middle;    }    .dataframe tbody tr th {        vertical-align: top;    }    .dataframe thead th {        text-align: right;    }


openclosehighlowvolumemoneyo-co-c2
2018-12-173448.03435.03470.03414.03915682.01.345986e+11NaN13.0
2018-12-193432.03425.03450.03411.03871012.01.328182e+11NaN7.0
2018-12-243515.03451.03516.03435.03542690.01.227598e+11NaN64.0
2019-01-083521.03505.03523.03483.02740906.09.602977e+10NaN16.0
2019-01-153576.03519.03576.03508.03128980.01.105785e+11NaN57.0
2019-01-223650.03633.03700.03630.03970252.01.455658e+11NaN17.0
2019-01-283712.03681.03757.03680.03337710.01.238699e+11NaN31.0
2019-01-293680.03675.03694.03653.02508228.09.212779e+10NaN5.0
2019-01-303683.03677.03767.03667.04490378.01.671236e+11NaN6.0
2019-02-113850.03825.03908.03816.03480360.01.342309e+11NaN25.0
2019-02-123818.03785.03830.03771.03109428.01.180507e+11NaN33.0
2019-02-133780.03702.03796.03692.03814494.01.425991e+11NaN78.0
2019-02-143711.03684.03717.03668.03266348.01.205294e+11NaN27.0
2019-02-153680.03599.03696.03593.03916316.01.423635e+11NaN81.0
2019-02-193670.03655.03698.03647.02760590.01.013994e+11NaN15.0
2019-02-203650.03641.03669.03578.03918544.01.418150e+11NaN9.0
2019-02-253736.03682.03766.03681.03659076.01.363059e+11NaN54.0
2019-02-273737.03715.03750.03697.02948082.01.098979e+11NaN22.0
xx = (df['o-c2']>0)
type(xx)
pandas.core.series.Series
yy = df['o-c2'] >0
type(yy)
pandas.core.series.Series
df[ (df['o-c2'] >0)  & (df['close'] > 3400)   ]

.dataframe tbody tr th:only-of-type {        vertical-align: middle;    }    .dataframe tbody tr th {        vertical-align: top;    }    .dataframe thead th {        text-align: right;    }


openclosehighlowvolumemoneyo-co-c2
2018-12-173448.03435.03470.03414.03915682.01.345986e+11NaN13.0
2018-12-193432.03425.03450.03411.03871012.01.328182e+11NaN7.0
2018-12-243515.03451.03516.03435.03542690.01.227598e+11NaN64.0
2019-01-083521.03505.03523.03483.02740906.09.602977e+10NaN16.0
2019-01-153576.03519.03576.03508.03128980.01.105785e+11NaN57.0
2019-01-223650.03633.03700.03630.03970252.01.455658e+11NaN17.0
2019-01-283712.03681.03757.03680.03337710.01.238699e+11NaN31.0
2019-01-293680.03675.03694.03653.02508228.09.212779e+10NaN5.0
2019-01-303683.03677.03767.03667.04490378.01.671236e+11NaN6.0
2019-02-113850.03825.03908.03816.03480360.01.342309e+11NaN25.0
2019-02-123818.03785.03830.03771.03109428.01.180507e+11NaN33.0
2019-02-133780.03702.03796.03692.03814494.01.425991e+11NaN78.0
2019-02-143711.03684.03717.03668.03266348.01.205294e+11NaN27.0
2019-02-153680.03599.03696.03593.03916316.01.423635e+11NaN81.0
2019-02-193670.03655.03698.03647.02760590.01.013994e+11NaN15.0
2019-02-203650.03641.03669.03578.03918544.01.418150e+11NaN9.0
2019-02-253736.03682.03766.03681.03659076.01.363059e+11NaN54.0
2019-02-273737.03715.03750.03697.02948082.01.098979e+11NaN22.0

2.6    重置 DataFrame 的索引¶

如果你觉得当前 DataFrame 的索引有问题,你可以用 .reset_index() 简单地把整个表的索引都重置掉。这个方法将把目标 DataFrame 的索引保存在一个叫 index 的列中,而把表格的索引变成默认的从零开始的数字,也就是 [0, ..., len(data) - 1] 。比如下面这样:¶

df.head()

.dataframe tbody tr th:only-of-type {        vertical-align: middle;    }    .dataframe tbody tr th {        vertical-align: top;    }    .dataframe thead th {        text-align: right;    }


openclosehighlowvolumemoneyo-co-c2
2018-12-033250.03333.03495.03208.05092070.01.696123e+11NaN-83.0
2018-12-043306.03374.03388.03283.04807696.01.600262e+11NaN-68.0
2018-12-053389.03463.03466.03362.05400890.01.840964e+11NaN-74.0
2018-12-063470.03375.03473.03371.04500922.01.535990e+11NaN95.0
2018-12-073360.03397.03426.03334.05127774.01.729243e+11NaN-37.0
df.reset_index().head()

.dataframe tbody tr th:only-of-type {        vertical-align: middle;    }    .dataframe tbody tr th {        vertical-align: top;    }    .dataframe thead th {        text-align: right;    }


indexopenclosehighlowvolumemoneyo-co-c2
02018-12-033250.03333.03495.03208.05092070.01.696123e+11NaN-83.0
12018-12-043306.03374.03388.03283.04807696.01.600262e+11NaN-68.0
22018-12-053389.03463.03466.03362.05400890.01.840964e+11NaN-74.0
32018-12-063470.03375.03473.03371.04500922.01.535990e+11NaN95.0
42018-12-073360.03397.03426.03334.05127774.01.729243e+11NaN-37.0

和删除操作差不多,.reset_index() 并不会永久改变你表格的索引,除非你调用的时候明确传入了 inplace 参数,比如:.reset_index(inplace=True)¶

df.head()

.dataframe tbody tr th:only-of-type {        vertical-align: middle;    }    .dataframe tbody tr th {        vertical-align: top;    }    .dataframe thead th {        text-align: right;    }


openclosehighlowvolumemoneyo-co-c2
2018-12-033250.03333.03495.03208.05092070.01.696123e+11NaN-83.0
2018-12-043306.03374.03388.03283.04807696.01.600262e+11NaN-68.0
2018-12-053389.03463.03466.03362.05400890.01.840964e+11NaN-74.0
2018-12-063470.03375.03473.03371.04500922.01.535990e+11NaN95.0
2018-12-073360.03397.03426.03334.05127774.01.729243e+11NaN-37.0
df.reset_index(inplace=True)
df.head()

.dataframe tbody tr th:only-of-type {        vertical-align: middle;    }    .dataframe tbody tr th {        vertical-align: top;    }    .dataframe thead th {        text-align: right;    }


indexopenclosehighlowvolumemoneyo-co-c2
02018-12-033250.03333.03495.03208.05092070.01.696123e+11NaN-83.0
12018-12-043306.03374.03388.03283.04807696.01.600262e+11NaN-68.0
22018-12-053389.03463.03466.03362.05400890.01.840964e+11NaN-74.0
32018-12-063470.03375.03473.03371.04500922.01.535990e+11NaN95.0
42018-12-073360.03397.03426.03334.05127774.01.729243e+11NaN-37.0
 

2.7   设置 DataFrame 的索引值¶

类似地,我们还可以用 .set_index() 方法,将 DataFrame 里的某一列作为索引来用。比如,我们在这个表里新建一个名为 "ID" 的列¶

df.head()

.dataframe tbody tr th:only-of-type {        vertical-align: middle;    }    .dataframe tbody tr th {        vertical-align: top;    }    .dataframe thead th {        text-align: right;    }


indexopenclosehighlowvolumemoneyo-co-c2
02018-12-033250.03333.03495.03208.05092070.01.696123e+11NaN-83.0
12018-12-043306.03374.03388.03283.04807696.01.600262e+11NaN-68.0
22018-12-053389.03463.03466.03362.05400890.01.840964e+11NaN-74.0
32018-12-063470.03375.03473.03371.04500922.01.535990e+11NaN95.0
42018-12-073360.03397.03426.03334.05127774.01.729243e+11NaN-37.0
df['code'] ='rb9999'
df.head()

.dataframe tbody tr th:only-of-type {        vertical-align: middle;    }    .dataframe tbody tr th {        vertical-align: top;    }    .dataframe thead th {        text-align: right;    }


indexopenclosehighlowvolumemoneyo-co-c2code
02018-12-033250.03333.03495.03208.05092070.01.696123e+11NaN-83.0rb9999
12018-12-043306.03374.03388.03283.04807696.01.600262e+11NaN-68.0rb9999
22018-12-053389.03463.03466.03362.05400890.01.840964e+11NaN-74.0rb9999
32018-12-063470.03375.03473.03371.04500922.01.535990e+11NaN95.0rb9999
42018-12-073360.03397.03426.03334.05127774.01.729243e+11NaN-37.0rb9999
df.set_index('code').head()

.dataframe tbody tr th:only-of-type {        vertical-align: middle;    }    .dataframe tbody tr th {        vertical-align: top;    }    .dataframe thead th {        text-align: right;    }


indexopenclosehighlowvolumemoneyo-co-c2
code








rb99992018-12-033250.03333.03495.03208.05092070.01.696123e+11NaN-83.0
rb99992018-12-043306.03374.03388.03283.04807696.01.600262e+11NaN-68.0
rb99992018-12-053389.03463.03466.03362.05400890.01.840964e+11NaN-74.0
rb99992018-12-063470.03375.03473.03371.04500922.01.535990e+11NaN95.0
rb99992018-12-073360.03397.03426.03334.05127774.01.729243e+11NaN-37.0
df.head()

.dataframe tbody tr th:only-of-type {        vertical-align: middle;    }    .dataframe tbody tr th {        vertical-align: top;    }    .dataframe thead th {        text-align: right;    }


indexopenclosehighlowvolumemoneyo-co-c2code
02018-12-033250.03333.03495.03208.05092070.01.696123e+11NaN-83.0rb9999
12018-12-043306.03374.03388.03283.04807696.01.600262e+11NaN-68.0rb9999
22018-12-053389.03463.03466.03362.05400890.01.840964e+11NaN-74.0rb9999
32018-12-063470.03375.03473.03371.04500922.01.535990e+11NaN95.0rb9999
42018-12-073360.03397.03426.03334.05127774.01.729243e+11NaN-37.0rb9999

注意,不像 .reset_index() 会保留一个备份,然后才用默认的索引值代替原索引,.set_index() 将会完全覆盖原来的索引值。

 

2.8   多级索引(MultiIndex)以及命名索引的不同等级¶

多级索引其实就是一个由元组(Tuple)组成的数组,每一个元组都是独一无二的。你可以从一个包含许多数组的列表中创建多级索引(调用 MultiIndex.from_arrays ),也可以用一个包含许多元组的数组(调用 MultiIndex.from_tuples )或者是用一对可迭代对象的集合(比如两个列表,互相两两配对)来构建(调用MultiIndex.from_product )。¶

下面这个例子,我们从元组中创建多级索引:¶

应该暂时用不到多级索引,以后学习。

df.head()

.dataframe tbody tr th:only-of-type {        vertical-align: middle;    }    .dataframe tbody tr th {        vertical-align: top;    }    .dataframe thead th {        text-align: right;    }


indexopenclosehighlowvolumemoneyo-co-c2code
02018-12-033250.03333.03495.03208.05092070.01.696123e+11NaN-83.0rb9999
12018-12-043306.03374.03388.03283.04807696.01.600262e+11NaN-68.0rb9999
22018-12-053389.03463.03466.03362.05400890.01.840964e+11NaN-74.0rb9999
32018-12-063470.03375.03473.03371.04500922.01.535990e+11NaN95.0rb9999
42018-12-073360.03397.03426.03334.05127774.01.729243e+11NaN-37.0rb9999

我们可以用 .index.names 给它们加上名字????

 

2.9    清洗数据¶

删除或填充空值¶

在许多情况下,如果你用 Pandas 来读取大量数据,往往会发现原始数据中会存在不完整的地方。在 DataFrame 中缺少数据的位置, Pandas 会自动填入一个空值,比如 NaN或 Null 。因此,我们可以选择用 .dropna() 来丢弃这些自动填充的值,或是用.fillna() 来自动给这些空值填充数据。¶

比如这个例子:¶

df.head()

.dataframe tbody tr th:only-of-type {        vertical-align: middle;    }    .dataframe tbody tr th {        vertical-align: top;    }    .dataframe thead th {        text-align: right;    }


indexopenclosehighlowvolumemoneyo-co-c2code
02018-12-033250.03333.03495.03208.05092070.01.696123e+11NaN-83.0rb9999
12018-12-043306.03374.03388.03283.04807696.01.600262e+11NaN-68.0rb9999
22018-12-053389.03463.03466.03362.05400890.01.840964e+11NaN-74.0rb9999
32018-12-063470.03375.03473.03371.04500922.01.535990e+11NaN95.0rb9999
42018-12-073360.03397.03426.03334.05127774.01.729243e+11NaN-37.0rb9999
df.dropna()

.dataframe tbody tr th:only-of-type {        vertical-align: middle;    }    .dataframe tbody tr th {        vertical-align: top;    }    .dataframe thead th {        text-align: right;    }


indexopenclosehighlowvolumemoneyo-co-c2code

请注意,如果你没有指定 axis 参数,默认是删除行。¶

axis =1 , 删除含有NaN,Null  的列¶

df.dropna(axis =  1).head()

.dataframe tbody tr th:only-of-type {        vertical-align: middle;    }    .dataframe tbody tr th {        vertical-align: top;    }    .dataframe thead th {        text-align: right;    }


indexopenclosehighlowvolumemoneyo-c2code
02018-12-033250.03333.03495.03208.05092070.01.696123e+11-83.0rb9999
12018-12-043306.03374.03388.03283.04807696.01.600262e+11-68.0rb9999
22018-12-053389.03463.03466.03362.05400890.01.840964e+11-74.0rb9999
32018-12-063470.03375.03473.03371.04500922.01.535990e+1195.0rb9999
42018-12-073360.03397.03426.03334.05127774.01.729243e+11-37.0rb9999
df

.dataframe tbody tr th:only-of-type {        vertical-align: middle;    }    .dataframe tbody tr th {        vertical-align: top;    }    .dataframe thead th {        text-align: right;    }


indexopenclosehighlowvolumemoneyo-co-c2code
02018-12-033250.03333.03495.03208.05092070.01.696123e+11NaN-83.0rb9999
12018-12-043306.03374.03388.03283.04807696.01.600262e+11NaN-68.0rb9999
22018-12-053389.03463.03466.03362.05400890.01.840964e+11NaN-74.0rb9999
32018-12-063470.03375.03473.03371.04500922.01.535990e+11NaN95.0rb9999
42018-12-073360.03397.03426.03334.05127774.01.729243e+11NaN-37.0rb9999
52018-12-103388.03312.03398.03303.04741366.01.584575e+11NaN76.0rb9999
62018-12-113306.03327.03341.03284.03595984.01.189109e+11NaN-21.0rb9999
72018-12-123340.03317.03379.03317.04194776.01.404878e+11NaN23.0rb9999
82018-12-133325.03419.03431.03321.05327258.01.805737e+11NaN-94.0rb9999
92018-12-143416.03445.03450.03405.04124206.01.413454e+11NaN-29.0rb9999
102018-12-173448.03435.03470.03414.03915682.01.345986e+11NaN13.0rb9999
112018-12-183430.03435.03443.03391.03671214.01.255123e+11NaN-5.0rb9999
122018-12-193432.03425.03450.03411.03871012.01.328182e+11NaN7.0rb9999
132018-12-203433.03481.03492.03427.04306708.01.488089e+11NaN-48.0rb9999
142018-12-213500.03508.03535.03471.04313258.01.511073e+11NaN-8.0rb9999
152018-12-243515.03451.03516.03435.03542690.01.227598e+11NaN64.0rb9999
162018-12-253442.03398.03448.03357.04333400.01.473822e+11NaN44.0rb9999
172018-12-263393.03409.03430.03387.03285056.01.119878e+11NaN-16.0rb9999
182018-12-273418.03396.03471.03395.04172568.01.428099e+11NaN22.0rb9999
192018-12-283387.03404.03420.03382.02915936.09.916103e+10NaN-17.0rb9999
202019-01-023398.03382.03439.03371.02149936.07.286511e+10NaN16.0rb9999
212019-01-033378.03455.03456.03366.03719028.01.271959e+11NaN-77.0rb9999
222019-01-043460.03486.03493.03430.03323322.01.149595e+11NaN-26.0rb9999
232019-01-073490.03520.03526.03466.03053042.01.068218e+11NaN-30.0rb9999
242019-01-083521.03505.03523.03483.02740906.09.602977e+10NaN16.0rb9999
252019-01-093505.03507.03550.03494.03378598.01.190845e+11NaN-2.0rb9999
262019-01-103512.03514.03543.03502.03134668.01.104706e+11NaN-2.0rb9999
272019-01-113517.03539.03540.03492.03440962.01.210772e+11NaN-22.0rb9999
282019-01-143535.03575.03576.03520.02624250.09.324587e+10NaN-40.0rb9999
292019-01-153576.03519.03576.03508.03128980.01.105785e+11NaN57.0rb9999
302019-01-163519.03534.03545.03510.02632026.09.282553e+10NaN-15.0rb9999
312019-01-173529.03551.03565.03525.02258818.08.021438e+10NaN-22.0rb9999
322019-01-183552.03633.03636.03540.03599670.01.294739e+11NaN-81.0rb9999
332019-01-213636.03645.03675.03619.03526340.01.285795e+11NaN-9.0rb9999
342019-01-223650.03633.03700.03630.03970252.01.455658e+11NaN17.0rb9999
352019-01-233639.03644.03658.03625.02631994.09.588602e+10NaN-5.0rb9999
362019-01-243647.03680.03684.03639.03200450.01.171679e+11NaN-33.0rb9999
372019-01-253684.03710.03740.03675.03099552.01.150852e+11NaN-26.0rb9999
382019-01-283712.03681.03757.03680.03337710.01.238699e+11NaN31.0rb9999
392019-01-293680.03675.03694.03653.02508228.09.212779e+10NaN5.0rb9999
402019-01-303683.03677.03767.03667.04490378.01.671236e+11NaN6.0rb9999
412019-01-313682.03707.03718.03673.02724362.01.006953e+11NaN-25.0rb9999
422019-02-013696.03754.03770.03696.02506846.09.364989e+10NaN-58.0rb9999
432019-02-113850.03825.03908.03816.03480360.01.342309e+11NaN25.0rb9999
442019-02-123818.03785.03830.03771.03109428.01.180507e+11NaN33.0rb9999
452019-02-133780.03702.03796.03692.03814494.01.425991e+11NaN78.0rb9999
462019-02-143711.03684.03717.03668.03266348.01.205294e+11NaN27.0rb9999
472019-02-153680.03599.03696.03593.03916316.01.423635e+11NaN81.0rb9999
482019-02-183615.03659.03675.03611.03532344.01.287828e+11NaN-44.0rb9999
492019-02-193670.03655.03698.03647.02760590.01.013994e+11NaN15.0rb9999
502019-02-203650.03641.03669.03578.03918544.01.418150e+11NaN9.0rb9999
512019-02-213640.03677.03695.03635.03928972.01.442938e+11NaN-37.0rb9999
522019-02-223669.03731.03735.03663.03278144.01.215958e+11NaN-62.0rb9999
532019-02-253736.03682.03766.03681.03659076.01.363059e+11NaN54.0rb9999
542019-02-263677.03736.03745.03672.03998202.01.482914e+11NaN-59.0rb9999
552019-02-273737.03715.03750.03697.02948082.01.098979e+11NaN22.0rb9999
562019-02-283728.03750.03755.03705.03461158.01.292056e+11NaN-22.0rb9999
572019-03-013756.03815.03816.03751.02986416.01.127984e+11NaN-59.0rb9999

类似的,如果你使用 .fillna() 方法,Pandas 将对这个 DataFrame 里所有的空值位置填上你指定的默认值。比如,将表中所有 NaN 替换成 20 :

同理,.dropna() 和 .fillna() 并不会永久性改变你的数据,除非你传入了inplace=True 参数。

 
 
 
 
df['cross']= nan
df

.dataframe tbody tr th:only-of-type {        vertical-align: middle;    }    .dataframe tbody tr th {        vertical-align: top;    }    .dataframe thead th {        text-align: right;    }


openclosehighlowvolumemoneycmpcross
2018-12-033250.03333.03495.03208.05092070.01.696123e+11-83.0NaN
2018-12-043306.03374.03388.03283.04807696.01.600262e+11-68.0NaN
2018-12-053389.03463.03466.03362.05400890.01.840964e+11-74.0NaN
2018-12-063470.03375.03473.03371.04500922.01.535990e+1195.0NaN
2018-12-073360.03397.03426.03334.05127774.01.729243e+11-37.0NaN
2018-12-103388.03312.03398.03303.04741366.01.584575e+1176.0NaN
2018-12-113306.03327.03341.03284.03595984.01.189109e+11-21.0NaN
2018-12-123340.03317.03379.03317.04194776.01.404878e+1123.0NaN
2018-12-133325.03419.03431.03321.05327258.01.805737e+11-94.0NaN
2018-12-143416.03445.03450.03405.04124206.01.413454e+11-29.0NaN
2018-12-173448.03435.03470.03414.03915682.01.345986e+1113.0NaN
2018-12-183430.03435.03443.03391.03671214.01.255123e+11-5.0NaN
2018-12-193432.03425.03450.03411.03871012.01.328182e+117.0NaN
2018-12-203433.03481.03492.03427.04306708.01.488089e+11-48.0NaN
2018-12-213500.03508.03535.03471.04313258.01.511073e+11-8.0NaN
2018-12-243515.03451.03516.03435.03542690.01.227598e+1164.0NaN
2018-12-253442.03398.03448.03357.04333400.01.473822e+1144.0NaN
2018-12-263393.03409.03430.03387.03285056.01.119878e+11-16.0NaN
2018-12-273418.03396.03471.03395.04172568.01.428099e+1122.0NaN
2018-12-283387.03404.03420.03382.02915936.09.916103e+10-17.0NaN
2019-01-023398.03382.03439.03371.02149936.07.286511e+1016.0NaN
2019-01-033378.03455.03456.03366.03719028.01.271959e+11-77.0NaN
2019-01-043460.03486.03493.03430.03323322.01.149595e+11-26.0NaN
2019-01-073490.03520.03526.03466.03053042.01.068218e+11-30.0NaN
2019-01-083521.03505.03523.03483.02740906.09.602977e+1016.0NaN
2019-01-093505.03507.03550.03494.03378598.01.190845e+11-2.0NaN
2019-01-103512.03514.03543.03502.03134668.01.104706e+11-2.0NaN
2019-01-113517.03539.03540.03492.03440962.01.210772e+11-22.0NaN
2019-01-143535.03575.03576.03520.02624250.09.324587e+10-40.0NaN
2019-01-153576.03519.03576.03508.03128980.01.105785e+1157.0NaN
2019-01-163519.03534.03545.03510.02632026.09.282553e+10-15.0NaN
2019-01-173529.03551.03565.03525.02258818.08.021438e+10-22.0NaN
2019-01-183552.03633.03636.03540.03599670.01.294739e+11-81.0NaN
2019-01-213636.03645.03675.03619.03526340.01.285795e+11-9.0NaN
2019-01-223650.03633.03700.03630.03970252.01.455658e+1117.0NaN
2019-01-233639.03644.03658.03625.02631994.09.588602e+10-5.0NaN
2019-01-243647.03680.03684.03639.03200450.01.171679e+11-33.0NaN
2019-01-253684.03710.03740.03675.03099552.01.150852e+11-26.0NaN
2019-01-283712.03681.03757.03680.03337710.01.238699e+1131.0NaN
2019-01-293680.03675.03694.03653.02508228.09.212779e+105.0NaN
2019-01-303683.03677.03767.03667.04490378.01.671236e+116.0NaN
2019-01-313682.03707.03718.03673.02724362.01.006953e+11-25.0NaN
2019-02-013696.03754.03770.03696.02506846.09.364989e+10-58.0NaN
2019-02-113850.03825.03908.03816.03480360.01.342309e+1125.0NaN
2019-02-123818.03785.03830.03771.03109428.01.180507e+1133.0NaN
2019-02-133780.03702.03796.03692.03814494.01.425991e+1178.0NaN
2019-02-143711.03684.03717.03668.03266348.01.205294e+1127.0NaN
2019-02-153680.03599.03696.03593.03916316.01.423635e+1181.0NaN
2019-02-183615.03659.03675.03611.03532344.01.287828e+11-44.0NaN
2019-02-193670.03655.03698.03647.02760590.01.013994e+1115.0NaN
2019-02-203650.03641.03669.03578.03918544.01.418150e+119.0NaN
2019-02-213640.03677.03695.03635.03928972.01.442938e+11-37.0NaN
2019-02-223669.03731.03735.03663.03278144.01.215958e+11-62.0NaN
2019-02-253736.03682.03766.03681.03659076.01.363059e+1154.0NaN
2019-02-263677.03736.03745.03672.03998202.01.482914e+11-59.0NaN
2019-02-273737.03715.03750.03697.02948082.01.098979e+1122.0NaN
2019-02-283728.03750.03755.03705.03461158.01.292056e+11-22.0NaN
2019-03-013756.03815.03816.03751.02986416.01.127984e+11-59.0NaN
def cross(cmp):if cmp > 0:crs = Trueelse:crs = Falsereturn crs
cross(df['cmp'])
-ValueError                                Traceback (most recent call last)<ipython-input-26-7833c6976bc2> in <module>> 1 cross(df['cmp'])<ipython-input-25-70d306d78f78> in cross(cmp)      1 def cross(cmp):> 2     if cmp > 0:      3         crs = True      4     else:      5         crs = False/opt/conda/lib/python3.6/site-packages/pandas/core/generic.py in __nonzero__(self)   1574         raise ValueError("The truth value of a {0} is ambiguous. "   1575                          "Use a.empty, a.bool(), a.item(), a.any() or a.all()."-> 1576                          .format(self.__class__.__name__))   1577    1578     __bool__ = __nonzero__ValueError: The truth value of a Series is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all().
ts =pd.Series(np.random.randn(1000), index=pd.date_range("2001-1-1",periods = 1000))
df4 = pd.DataFrame(np.random.randn(1000,4),index=ts.index,columns = ['a','b','c','d'])
df5 =df4.cumsum()
df5.plot()
<matplotlib.axes._subplots.AxesSubplot at 0x7f774e988400>
df5.dtypes
a    float64
b    float64
c    float64
d    float64
dtype: object
df.dtypes
index     datetime64[ns]
open             float64
close            float64
high             float64
low              float64
volume           float64
money            float64
o-c              float64
o-c2             float64
code              object
dtype: object
 

全部回复

0/140

量化课程

    移动端课程