問題描述
我有一個 MEMS IMU,我一直在其上收集數(shù)據(jù),我正在使用 pandas 從中獲取一些統(tǒng)計數(shù)據(jù).每個周期收集 6 個 32 位浮點數(shù).對于給定的收集運行,數(shù)據(jù)速率是固定的.數(shù)據(jù)速率在 100Hz 和 1000Hz 之間變化,采集時間長達 72 小時.數(shù)據(jù)保存在一個平面二進制文件中.我是這樣讀取數(shù)據(jù)的:
I have a MEMS IMU on which I've been collecting data and I'm using pandas to get some statistical data from it. There are 6 32-bit floats collected each cycle. Data rates are fixed for a given collection run. The data rates vary between 100Hz and 1000Hz and the collection times run as long as 72 hours. The data is saved in a flat binary file. I read the data this way:
import numpy as np
import pandas as pd
dataType=np.dtype([('a','<f4'),('b','<f4'),('c','<f4'),('d','<f4'),('e','<f4'),('e','<f4')])
df=pd.DataFrame(np.fromfile('FILENAME',dataType))
df['c'].mean()
-9.880581855773926
x=df['c'].values
x.mean()
-9.8332081
-9.833 是正確的結(jié)果.我可以創(chuàng)建一個類似的結(jié)果,有人應該能夠以這種方式重復:
-9.833 is the correct result. I can create a similar result that someone should be able to repeat this way:
import numpy as np
import pandas as pd
x=np.random.normal(-9.8,.05,size=900000)
df=pd.DataFrame(x,dtype='float32',columns=['x'])
df['x'].mean()
-9.859579086303711
x.mean()
-9.8000648778888628
我在 linux 和 windows、AMD 和 Intel 處理器、Python 2.7 和 3.5 上重復了這一點.我難住了.我究竟做錯了什么?得到這個:
I've repeated this on linux and windows, on AMD and Intel processors, in Python 2.7 and 3.5. I'm stumped. What am I doing wrong? And get this:
x=np.random.normal(-9.,.005,size=900000)
df=pd.DataFrame(x,dtype='float32',columns=['x'])
df['x'].mean()
-8.999998092651367
x.mean()
-9.0000075889406528
我可以接受這種差異.它處于 32 位浮點數(shù)精度的極限.
I could accept this difference. It's at the limit of the precision of 32 bit floats.
沒關系.我在星期五寫了這篇文章,今天早上我遇到了解決方案.這是由于大量數(shù)據(jù)而加劇的浮點精度問題.我需要以這種方式在創(chuàng)建數(shù)據(jù)幀時將數(shù)據(jù)轉(zhuǎn)換為 64 位浮點數(shù):
NEVERMIND. I wrote this on Friday and the solution hit me this morning. It is a floating point precision problem exacerbated by the large amount of data. I needed to convert the data into 64 bit float on the creation of the dataframe this way:
df=pd.DataFrame(np.fromfile('FILENAME',dataType),dtype='float64')
如果其他人遇到類似問題,我會離開帖子.
I'll leave the post should anyone else run into a similar issue.
推薦答案
短版:
之所以不同,是因為 pandas
在調(diào)用 mean
操作時使用了 bottleneck
(如果已安裝),而不是僅僅依賴numpy
.bottleneck
可能被使用,因為它似乎比 numpy
更快(至少在我的機器上),但以精度為代價.它們恰好匹配 64 位版本,但在 32 位土地上有所不同(這是有趣的部分).
Short version:
The reason it's different is because pandas
uses bottleneck
(if it's installed) when calling the mean
operation, as opposed to just relying on numpy
. bottleneck
is presumably used since it appears to be faster than numpy
(at least on my machine), but at the cost of precision. They happen to match for the 64 bit version, but differ in 32 bit land (which is the interesting part).
僅通過檢查這些模塊的源代碼很難判斷發(fā)生了什么(它們非常復雜,即使對于像 mean
這樣的簡單計算,數(shù)值計算也很困難).最好使用調(diào)試器來避免大腦編譯和那些類型的錯誤.調(diào)試器不會在邏輯上出錯,它會準確地告訴你發(fā)生了什么.
It's extremely difficult to tell what's going on just by inspecting the source code of these modules (they're quite complex, even for simple computations like mean
, turns out numerical computing is hard). Best to use the debugger to avoid brain-compiling and those types of mistakes. The debugger won't make a mistake in logic, it'll tell you exactly what's going on.
這是我的一些堆棧跟蹤(值略有不同,因為沒有 RNG 種子):
Here's some of my stack trace (values differ slightly since no seed for RNG):
可以重現(xiàn)(Windows):
>>> import numpy as np; import pandas as pd
>>> x=np.random.normal(-9.,.005,size=900000)
>>> df=pd.DataFrame(x,dtype='float32',columns=['x'])
>>> df['x'].mean()
-9.0
>>> x.mean()
-9.0000037501099754
>>> x.astype(np.float32).mean()
-9.0000029
numpy
的版本沒有什么特別之處.pandas
版本有點古怪.
Nothing extraordinary going on with numpy
's version. It's the pandas
version that's a little wacky.
我們來看看df['x'].mean()
:
>>> def test_it_2():
... import pdb; pdb.set_trace()
... df['x'].mean()
>>> test_it_2()
... # Some stepping/poking around that isn't important
(Pdb) l
2307
2308 if we have an ndarray as a value, then simply perform the operation,
2309 otherwise delegate to the object
2310
2311 """
2312 -> delegate = self._values
2313 if isinstance(delegate, np.ndarray):
2314 # Validate that 'axis' is consistent with Series's single axis.
2315 self._get_axis_number(axis)
2316 if numeric_only:
2317 raise NotImplementedError('Series.{0} does not implement '
(Pdb) delegate.dtype
dtype('float32')
(Pdb) l
2315 self._get_axis_number(axis)
2316 if numeric_only:
2317 raise NotImplementedError('Series.{0} does not implement '
2318 'numeric_only.'.format(name))
2319 with np.errstate(all='ignore'):
2320 -> return op(delegate, skipna=skipna, **kwds)
2321
2322 return delegate._reduce(op=op, name=name, axis=axis, skipna=skipna,
2323 numeric_only=numeric_only,
2324 filter_type=filter_type, **kwds)
所以我們找到了問題所在,但現(xiàn)在事情變得有點奇怪:
So we found the trouble spot, but now things get kind of weird:
(Pdb) op
<function nanmean at 0x000002CD8ACD4488>
(Pdb) op(delegate)
-9.0
(Pdb) delegate_64 = delegate.astype(np.float64)
(Pdb) op(delegate_64)
-9.000003749978807
(Pdb) delegate.mean()
-9.0000029
(Pdb) delegate_64.mean()
-9.0000037499788075
(Pdb) np.nanmean(delegate, dtype=np.float64)
-9.0000037499788075
(Pdb) np.nanmean(delegate, dtype=np.float32)
-9.0000029
注意 delegate.mean()
和 np.nanmean
輸出 -9.0000029
類型為 float32
,not -9.0
不像 pandas
nanmean
那樣.稍微摸索一下,您可以在 pandas.core.nanops
中找到 pandas
nanmean
的源代碼.有趣的是,它實際上看起來應該首先匹配numpy
.我們來看看pandas
nanmean
:
Note that delegate.mean()
and np.nanmean
output -9.0000029
with type float32
, not -9.0
as pandas
nanmean
does. With a bit of poking around, you can find the source to pandas
nanmean
in pandas.core.nanops
. Interestingly, it actually appears like it should be matching numpy
at first. Let's have a look at pandas
nanmean
:
(Pdb) import inspect
(Pdb) src = inspect.getsource(op).split("
")
(Pdb) for line in src: print(line)
@disallow('M8')
@bottleneck_switch()
def nanmean(values, axis=None, skipna=True):
values, mask, dtype, dtype_max = _get_values(values, skipna, 0)
dtype_sum = dtype_max
dtype_count = np.float64
if is_integer_dtype(dtype) or is_timedelta64_dtype(dtype):
dtype_sum = np.float64
elif is_float_dtype(dtype):
dtype_sum = dtype
dtype_count = dtype
count = _get_counts(mask, axis, dtype=dtype_count)
the_sum = _ensure_numeric(values.sum(axis, dtype=dtype_sum))
if axis is not None and getattr(the_sum, 'ndim', False):
the_mean = the_sum / count
ct_mask = count == 0
if ct_mask.any():
the_mean[ct_mask] = np.nan
else:
the_mean = the_sum / count if count > 0 else np.nan
return _wrap_results(the_mean, dtype)
這是 bottleneck_switch
裝飾器的(短)版本:
Here's a (short) version of the bottleneck_switch
decorator:
import bottleneck as bn
...
class bottleneck_switch(object):
def __init__(self, **kwargs):
self.kwargs = kwargs
def __call__(self, alt):
bn_name = alt.__name__
try:
bn_func = getattr(bn, bn_name)
except (AttributeError, NameError): # pragma: no cover
bn_func = None
...
if (_USE_BOTTLENECK and skipna and
_bn_ok_dtype(values.dtype, bn_name)):
result = bn_func(values, axis=axis, **kwds)
這是用 alt
作為 pandas
nanmean
函數(shù)調(diào)用的,所以 bn_name
是 'nanmean'
,這是從 bottleneck
模塊中獲取的屬性:
This is called with alt
as the pandas
nanmean
function, so bn_name
is 'nanmean'
, and this is the attr that's grabbed from the bottleneck
module:
(Pdb) l
93 result = np.empty(result_shape)
94 result.fill(0)
95 return result
96
97 if (_USE_BOTTLENECK and skipna and
98 -> _bn_ok_dtype(values.dtype, bn_name)):
99 result = bn_func(values, axis=axis, **kwds)
100
101 # prefer to treat inf/-inf as NA, but must compute the fun
102 # twice :(
103 if _has_infs(result):
(Pdb) n
> d:anaconda3libsite-packagespandascore
anops.py(99)f()
-> result = bn_func(values, axis=axis, **kwds)
(Pdb) alt
<function nanmean at 0x000001D2C8C04378>
(Pdb) alt.__name__
'nanmean'
(Pdb) bn_func
<built-in function nanmean>
(Pdb) bn_name
'nanmean'
(Pdb) bn_func(values, axis=axis, **kwds)
-9.0
假設 bottleneck_switch()
裝飾器一秒鐘不存在.我們實際上可以看到,手動單步執(zhí)行這個函數(shù)(沒有 bottleneck
)調(diào)用會得到與 numpy
相同的結(jié)果:
Pretend that bottleneck_switch()
decorator doesn't exist for a second. We can actually see that calling that manually stepping through this function (without bottleneck
) will get you the same result as numpy
:
(Pdb) from pandas.core.nanops import _get_counts
(Pdb) from pandas.core.nanops import _get_values
(Pdb) from pandas.core.nanops import _ensure_numeric
(Pdb) values, mask, dtype, dtype_max = _get_values(delegate, skipna=skipna)
(Pdb) count = _get_counts(mask, axis=None, dtype=dtype)
(Pdb) count
900000.0
(Pdb) values.sum(axis=None, dtype=dtype) / count
-9.0000029
但是,如果您安裝了 bottleneck
,則永遠不會調(diào)用它.取而代之的是,bottleneck_switch()
裝飾器使用 bottleneck
的版本對 nanmean
函數(shù)進行爆破.這就是差異所在(有趣的是,它匹配 float64
案例):
That never gets called, though, if you have bottleneck
installed. Instead, the bottleneck_switch()
decorator instead blasts over the nanmean
function with bottleneck
's version. This is where the discrepancy lies (interestingly it matches on the float64
case, though):
(Pdb) import bottleneck as bn
(Pdb) bn.nanmean(delegate)
-9.0
(Pdb) bn.nanmean(delegate.astype(np.float64))
-9.000003749978807
bottleneck
據(jù)我所知,僅用于速度.我假設他們正在使用他們的 nanmean
函數(shù)采取某種快捷方式,但我沒有深入研究它(有關此主題的詳細信息,請參閱@ead 的答案).你可以看到它通常比 numpy
通過他們的基準測試快一點:https://github.com/kwgoodman/bottleneck.顯然,要為這種速度付出的代價就是精度.
bottleneck
is used solely for speed, as far as I can tell. I'm assuming they're taking some type of shortcut with their nanmean
function, but I didn't look into it much (see @ead's answer for details on this topic). You can see that it's typically a bit faster than numpy
by their benchmarks: https://github.com/kwgoodman/bottleneck. Clearly the price to pay for this speed is precision.
瓶頸真的更快嗎?
確實看起來像(至少在我的機器上).
Sure looks like it (at least on my machine).
In [1]: import numpy as np; import pandas as pd
In [2]: x=np.random.normal(-9.8,.05,size=900000)
In [3]: y_32 = x.astype(np.float32)
In [13]: %timeit np.nanmean(y_32)
100 loops, best of 3: 5.72 ms per loop
In [14]: %timeit bn.nanmean(y_32)
1000 loops, best of 3: 854 μs per loop
pandas
在這里引入一個標志可能會很好(一個用于速度,另一個用于更好的精度,默認是速度,因為這是當前的實現(xiàn)).有些用戶更關心計算的準確性,而不是計算速度.
It might be nice for pandas
to introduce a flag here (one for speed, the other for better precision, default is for speed since that's the current impl). Some users care much more about the accuracy of the computation than the speed at which it happens.
HTH.
這篇關于pandas 和 numpy 的意思不同的文章就介紹到這了,希望我們推薦的答案對大家有所幫助,也希望大家多多支持html5模板網(wǎng)!