伊人影院在线视频,亚洲欧美激情视频,欧美日韩视频

本文介紹了pandas 和 numpy 的意思不同的處理方法，對大家解決問題具有一定的參考價值，需要的朋友們下面隨著小編來一起學習吧！

問題描述

我有一個 MEMS IMU，我一直在其上收集數(shù)據(jù)，我正在使用 pandas 從中獲取一些統(tǒng)計數(shù)據(jù).每個周期收集 6 個 32 位浮點數(shù).對于給定的收集運行，數(shù)據(jù)速率是固定的.數(shù)據(jù)速率在 100Hz 和 1000Hz 之間變化，采集時間長達 72 小時.數(shù)據(jù)保存在一個平面二進制文件中.我是這樣讀取數(shù)據(jù)的:

I have a MEMS IMU on which I've been collecting data and I'm using pandas to get some statistical data from it. There are 6 32-bit floats collected each cycle. Data rates are fixed for a given collection run. The data rates vary between 100Hz and 1000Hz and the collection times run as long as 72 hours. The data is saved in a flat binary file. I read the data this way:

import numpy as np
import pandas as pd
dataType=np.dtype([('a','<f4'),('b','<f4'),('c','<f4'),('d','<f4'),('e','<f4'),('e','<f4')])
df=pd.DataFrame(np.fromfile('FILENAME',dataType))
df['c'].mean()
-9.880581855773926
x=df['c'].values
x.mean()
-9.8332081

-9.833 是正確的結(jié)果.我可以創(chuàng)建一個類似的結(jié)果，有人應該能夠以這種方式重復:

-9.833 is the correct result. I can create a similar result that someone should be able to repeat this way:

import numpy as np
import pandas as pd
x=np.random.normal(-9.8,.05,size=900000)
df=pd.DataFrame(x,dtype='float32',columns=['x'])
df['x'].mean()
-9.859579086303711
x.mean()
-9.8000648778888628

我在 linux 和 windows、AMD 和 Intel 處理器、Python 2.7 和 3.5 上重復了這一點.我難住了.我究竟做錯了什么?得到這個:

I've repeated this on linux and windows, on AMD and Intel processors, in Python 2.7 and 3.5. I'm stumped. What am I doing wrong? And get this:

x=np.random.normal(-9.,.005,size=900000)
df=pd.DataFrame(x,dtype='float32',columns=['x'])
df['x'].mean()
-8.999998092651367
x.mean()
-9.0000075889406528

我可以接受這種差異.它處于 32 位浮點數(shù)精度的極限.

I could accept this difference. It's at the limit of the precision of 32 bit floats.

沒關系.我在星期五寫了這篇文章，今天早上我遇到了解決方案.這是由于大量數(shù)據(jù)而加劇的浮點精度問題.我需要以這種方式在創(chuàng)建數(shù)據(jù)幀時將數(shù)據(jù)轉(zhuǎn)換為 64 位浮點數(shù):

NEVERMIND. I wrote this on Friday and the solution hit me this morning. It is a floating point precision problem exacerbated by the large amount of data. I needed to convert the data into 64 bit float on the creation of the dataframe this way:

df=pd.DataFrame(np.fromfile('FILENAME',dataType),dtype='float64')

如果其他人遇到類似問題，我會離開帖子.

I'll leave the post should anyone else run into a similar issue.

短版:

之所以不同，是因為 pandas 在調(diào)用 mean 操作時使用了 bottleneck(如果已安裝)，而不是僅僅依賴numpy.bottleneck 可能被使用，因為它似乎比 numpy 更快(至少在我的機器上)，但以精度為代價.它們恰好匹配 64 位版本，但在 32 位土地上有所不同(這是有趣的部分).

Short version:

The reason it's different is because pandas uses bottleneck (if it's installed) when calling the mean operation, as opposed to just relying on numpy. bottleneck is presumably used since it appears to be faster than numpy (at least on my machine), but at the cost of precision. They happen to match for the 64 bit version, but differ in 32 bit land (which is the interesting part).

僅通過檢查這些模塊的源代碼很難判斷發(fā)生了什么(它們非常復雜，即使對于像 mean 這樣的簡單計算，數(shù)值計算也很困難).最好使用調(diào)試器來避免大腦編譯和那些類型的錯誤.調(diào)試器不會在邏輯上出錯，它會準確地告訴你發(fā)生了什么.

It's extremely difficult to tell what's going on just by inspecting the source code of these modules (they're quite complex, even for simple computations like mean, turns out numerical computing is hard). Best to use the debugger to avoid brain-compiling and those types of mistakes. The debugger won't make a mistake in logic, it'll tell you exactly what's going on.

這是我的一些堆棧跟蹤(值略有不同，因為沒有 RNG 種子):

Here's some of my stack trace (values differ slightly since no seed for RNG):

可以重現(xiàn)(Windows):

>>> import numpy as np; import pandas as pd
>>> x=np.random.normal(-9.,.005,size=900000)
>>> df=pd.DataFrame(x,dtype='float32',columns=['x'])
>>> df['x'].mean()
-9.0
>>> x.mean()
-9.0000037501099754
>>> x.astype(np.float32).mean()
-9.0000029

numpy 的版本沒有什么特別之處.pandas 版本有點古怪.

Nothing extraordinary going on with numpy's version. It's the pandas version that's a little wacky.

我們來看看df['x'].mean():

>>> def test_it_2():
...   import pdb; pdb.set_trace()
...   df['x'].mean()
>>> test_it_2()
... # Some stepping/poking around that isn't important
(Pdb) l
2307
2308            if we have an ndarray as a value, then simply perform the operation,
2309            otherwise delegate to the object
2310
2311            """
2312 ->         delegate = self._values
2313            if isinstance(delegate, np.ndarray):
2314                # Validate that 'axis' is consistent with Series's single axis.
2315                self._get_axis_number(axis)
2316                if numeric_only:
2317                    raise NotImplementedError('Series.{0} does not implement '
(Pdb) delegate.dtype
dtype('float32')
(Pdb) l
2315                self._get_axis_number(axis)
2316                if numeric_only:
2317                    raise NotImplementedError('Series.{0} does not implement '
2318                                              'numeric_only.'.format(name))
2319                with np.errstate(all='ignore'):
2320 ->                 return op(delegate, skipna=skipna, **kwds)
2321
2322            return delegate._reduce(op=op, name=name, axis=axis, skipna=skipna,
2323                                    numeric_only=numeric_only,
2324                                    filter_type=filter_type, **kwds)

所以我們找到了問題所在，但現(xiàn)在事情變得有點奇怪:

So we found the trouble spot, but now things get kind of weird:

(Pdb) op
<function nanmean at 0x000002CD8ACD4488>
(Pdb) op(delegate)
-9.0
(Pdb) delegate_64 = delegate.astype(np.float64)
(Pdb) op(delegate_64)
-9.000003749978807
(Pdb) delegate.mean()
-9.0000029
(Pdb) delegate_64.mean()
-9.0000037499788075
(Pdb) np.nanmean(delegate, dtype=np.float64)
-9.0000037499788075
(Pdb) np.nanmean(delegate, dtype=np.float32)
-9.0000029

注意 delegate.mean() 和 np.nanmean 輸出 -9.0000029 類型為 float32，not -9.0 不像 pandas nanmean 那樣.稍微摸索一下，您可以在 pandas.core.nanops 中找到 pandas nanmean 的源代碼.有趣的是，它實際上看起來應該首先匹配numpy.我們來看看pandas nanmean:

Note that delegate.mean() and np.nanmean output -9.0000029 with type float32, not -9.0 as pandas nanmean does. With a bit of poking around, you can find the source to pandas nanmean in pandas.core.nanops. Interestingly, it actually appears like it should be matching numpy at first. Let's have a look at pandas nanmean:

(Pdb) import inspect
(Pdb) src = inspect.getsource(op).split("
")
(Pdb) for line in src: print(line)
@disallow('M8')
@bottleneck_switch()
def nanmean(values, axis=None, skipna=True):
    values, mask, dtype, dtype_max = _get_values(values, skipna, 0)

    dtype_sum = dtype_max
    dtype_count = np.float64
    if is_integer_dtype(dtype) or is_timedelta64_dtype(dtype):
        dtype_sum = np.float64
    elif is_float_dtype(dtype):
        dtype_sum = dtype
        dtype_count = dtype
    count = _get_counts(mask, axis, dtype=dtype_count)
    the_sum = _ensure_numeric(values.sum(axis, dtype=dtype_sum))

    if axis is not None and getattr(the_sum, 'ndim', False):
        the_mean = the_sum / count
        ct_mask = count == 0
        if ct_mask.any():
            the_mean[ct_mask] = np.nan
    else:
        the_mean = the_sum / count if count > 0 else np.nan

    return _wrap_results(the_mean, dtype)

這是 bottleneck_switch 裝飾器的(短)版本:

Here's a (short) version of the bottleneck_switch decorator:

import bottleneck as bn
...
class bottleneck_switch(object):

    def __init__(self, **kwargs):
        self.kwargs = kwargs

    def __call__(self, alt):
        bn_name = alt.__name__

        try:
            bn_func = getattr(bn, bn_name)
        except (AttributeError, NameError):  # pragma: no cover
            bn_func = None
    ...

                if (_USE_BOTTLENECK and skipna and
                        _bn_ok_dtype(values.dtype, bn_name)):
                    result = bn_func(values, axis=axis, **kwds)

這是用 alt 作為 pandas nanmean 函數(shù)調(diào)用的，所以 bn_name 是 'nanmean'，這是從 bottleneck 模塊中獲取的屬性:

This is called with alt as the pandas nanmean function, so bn_name is 'nanmean', and this is the attr that's grabbed from the bottleneck module:

(Pdb) l
 93                             result = np.empty(result_shape)
 94                             result.fill(0)
 95                             return result
 96
 97                     if (_USE_BOTTLENECK and skipna and
 98  ->                         _bn_ok_dtype(values.dtype, bn_name)):
 99                         result = bn_func(values, axis=axis, **kwds)
100
101                         # prefer to treat inf/-inf as NA, but must compute the fun
102                         # twice :(
103                         if _has_infs(result):
(Pdb) n
> d:anaconda3libsite-packagespandascore
anops.py(99)f()
-> result = bn_func(values, axis=axis, **kwds)
(Pdb) alt
<function nanmean at 0x000001D2C8C04378>
(Pdb) alt.__name__
'nanmean'
(Pdb) bn_func
<built-in function nanmean>
(Pdb) bn_name
'nanmean'
(Pdb) bn_func(values, axis=axis, **kwds)
-9.0

假設 bottleneck_switch() 裝飾器一秒鐘不存在.我們實際上可以看到，手動單步執(zhí)行這個函數(shù)(沒有 bottleneck)調(diào)用會得到與 numpy 相同的結(jié)果:

Pretend that bottleneck_switch() decorator doesn't exist for a second. We can actually see that calling that manually stepping through this function (without bottleneck) will get you the same result as numpy:

(Pdb) from pandas.core.nanops import _get_counts
(Pdb) from pandas.core.nanops import _get_values
(Pdb) from pandas.core.nanops import _ensure_numeric
(Pdb) values, mask, dtype, dtype_max = _get_values(delegate, skipna=skipna)
(Pdb) count = _get_counts(mask, axis=None, dtype=dtype)
(Pdb) count
900000.0
(Pdb) values.sum(axis=None, dtype=dtype) / count
-9.0000029

但是，如果您安裝了 bottleneck，則永遠不會調(diào)用它.取而代之的是，bottleneck_switch() 裝飾器使用 bottleneck 的版本對 nanmean 函數(shù)進行爆破.這就是差異所在(有趣的是，它匹配 float64 案例):

That never gets called, though, if you have bottleneck installed. Instead, the bottleneck_switch() decorator instead blasts over the nanmean function with bottleneck's version. This is where the discrepancy lies (interestingly it matches on the float64 case, though):

(Pdb) import bottleneck as bn
(Pdb) bn.nanmean(delegate)
-9.0
(Pdb) bn.nanmean(delegate.astype(np.float64))
-9.000003749978807

bottleneck 據(jù)我所知，僅用于速度.我假設他們正在使用他們的 nanmean 函數(shù)采取某種快捷方式，但我沒有深入研究它(有關此主題的詳細信息，請參閱@ead 的答案).你可以看到它通常比 numpy 通過他們的基準測試快一點:https://github.com/kwgoodman/bottleneck.顯然，要為這種速度付出的代價就是精度.

bottleneck is used solely for speed, as far as I can tell. I'm assuming they're taking some type of shortcut with their nanmean function, but I didn't look into it much (see @ead's answer for details on this topic). You can see that it's typically a bit faster than numpy by their benchmarks: https://github.com/kwgoodman/bottleneck. Clearly the price to pay for this speed is precision.

瓶頸真的更快嗎?

確實看起來像(至少在我的機器上).

Sure looks like it (at least on my machine).

In [1]: import numpy as np; import pandas as pd

In [2]: x=np.random.normal(-9.8,.05,size=900000)

In [3]: y_32 = x.astype(np.float32)

In [13]: %timeit np.nanmean(y_32)
100 loops, best of 3: 5.72 ms per loop

In [14]: %timeit bn.nanmean(y_32)
1000 loops, best of 3: 854 μs per loop

pandas 在這里引入一個標志可能會很好(一個用于速度，另一個用于更好的精度，默認是速度，因為這是當前的實現(xiàn)).有些用戶更關心計算的準確性，而不是計算速度.

It might be nice for pandas to introduce a flag here (one for speed, the other for better precision, default is for speed since that's the current impl). Some users care much more about the accuracy of the computation than the speed at which it happens.

HTH.

這篇關于pandas 和 numpy 的意思不同的文章就介紹到這了，希望我們推薦的答案對大家有所幫助，也希望大家多多支持html5模板網(wǎng)！

【網(wǎng)站聲明】本站部分內(nèi)容來源于互聯(lián)網(wǎng),旨在幫助大家更快的解決問題，如果有圖片或者內(nèi)容侵犯了您的權(quán)益，請聯(lián)系我們刪除處理，感謝您的支持！

久久久久久久av_日韩在线中文_看一级毛片视频_日本精品二区_成人深夜福利视频_武道仙尊动漫在线观看

pandas 和 numpy 的意思不同

問題描述

推薦答案

短版:

Short version:

相關文檔推薦