先import需要用到的库, 并初始化一个dataframe, 用于测试¶
import pandas as pd
import time
def timmer(func):
def warpper(*args,**kwargs):
start_time = time.time()
func()
stop_time = time.time()
print ("the func run time is %s"%(stop_time - start_time))
return warpper
def add(num):
return num+2
df = pd.DataFrame(
columns=['a','b','c','d','e','f'],
index=['date','value'],
data=[['1/1/13 0:00','1/1/13 1:00','1/1/13 2:00','1/1/13 3:00','1/1/13 4:00','1/1/13 5:00'],[0.1,0.2,3,4,5,0.6]]
).T
df
确认我们使用的pandas版本
pd.__version__
先测试了字符串转换成datatime的时间,如文章中所说,加入format的确有帮助,速度提升明显,但是也并没有文中描述那么多,我想应该是pandas版本的差异,所以说还是需要自已动手验证一下啊¶
@timmer
def test_1():
for x in range(1000):
df['date'] = df['date'].apply(lambda x:'1/1/13 1:00')
df['date'] = pd.to_datetime(df['date'])
test_1()
@timmer
def test_2():
for x in range(1000):
df['date'] = df['date'].apply(lambda x:'1/1/13 1:00')
df['date'] = pd.to_datetime(df['date'], format='%d/%m/%y %H:%M')
test_2()
再来测试数据改写的速度,在这里我们使用了6种不同的方法:¶
从结果中可以发现, 矢量化test_8和.apply排名靠前, 是最佳选择, itertuples和iterrows排在第2梯队, loc速度最慢排在第3
@timmer
def test_3():
for x in range(1000):
for x in df.index:
df.loc[x, 'value'] = df.loc[x, 'value'] + 0.002
@timmer
def test_4():
for x in range(1000):
for index,row in df.iterrows():
row['value'] = row['value'] + 0.002
@timmer
def test_5():
for x in range(1000):
for index,date,value in df.itertuples():
value = value + 0.002
@timmer
def test_6():
for x in range(1000):
df['value'] = df['value'].apply(add)
@timmer
def test_7():
for x in range(1000):
df['value'] = df['value'].apply(lambda x:x+2)
@timmer
def test_8():
for x in range(1000):
df['value']+=0.002
test_3()
test_4()
test_5()
test_6()
test_7()
test_8()