問題描述
我有一個包含大部分為空值的時間序列數(shù)據(jù)的表,我想用最后一個已知值填充所有空值.
I have a table with time-series data that's mostly nulls, and I want to fill in all of the nulls with the last known value.
我有一些解決方案,但它們比在 Pandas 中執(zhí)行等效的 DataFrame.fillna(method='ffill')
操作要慢得多.
I have a few solutions, but they're much slower than doing the equivalent DataFrame.fillna(method='ffill')
operation in Pandas.
我正在使用的代碼/數(shù)據(jù)的簡化版本:
A simplified version of the code / data that I'm using:
select d.[date], d.[price],
(select top 1 p.price from price_table p
where p.price is not null and p.[date] <= p.[date]
order by p.[date] desc) as ff_price
from price_table d
制作桌子
date price ff_price
---------- ----- --------
2016-07-11 0.79 0.79
2016-07-12 NULL 0.79
2016-07-13 NULL 0.79
2016-07-14 0.69 0.69
2016-07-15 NULL 0.69
...
2016-09-21 0.88 0.88
...
我有超過 1 億行,所以這需要很長時間.
I have >100 million rows, so this takes quite a while.
推薦答案
假設(shè)你的列是 DATE
并且價格是 DECIMAL(5,2)
,請測試這個方法:
Assuming that your column is DATE
and price is DECIMAL(5,2)
, please test this approach:
SELECT
P.[date],
P.[price],
ff_price = CONVERT(
DECIMAL(5,2), -- Original price datatype
SUBSTRING(
MAX(
CAST(P.[date] AS BINARY(3)) + -- 3: datalength of P.[date] column
CAST(P.[price] AS BINARY(5)) -- 5: datalength of P.[price] column
) OVER (ORDER BY P.[date] ROWS UNBOUNDED PRECEDING),
4, -- Position to start that's not the binary part of the date
5))-- Characters that compose the binary of the original price datatype
FROM
price_table AS P
這是我用類似問題實現(xiàn)的解決方案,您可以找到詳盡的解釋 此處.這種方法之所以好是因為它不需要顯式排序,只要您有日期
的索引即可.
This is a solution I implemented with a similar problem and you can find the exaustive explanation here. The reason this approach is good is because it doesn't require a explicit sort, as long as you have an index by date
.
它所做的基本上是使用窗口化的 MAX
與組成日期列的 3 個字節(jié)的串聯(lián)(這就是為什么我提到您的列必須是 DATE
,否則 DATETIME
將需要 8 個字節(jié),您可以編輯查詢以使用它)使用構(gòu)成您的價格列的字節(jié)(也假定為 5 個字節(jié)).這是 CAST(P.[date] AS BINARY(3)) + CAST(P.[price] AS BINARY(5))
部分.
What it does is basically use a windowed MAX
with the concatenation of the 3 bytes that composes your date column (this is why I mentioned that you column must be DATE
, otherwise DATETIME
will need 8 bytes, you can edit the query to work with this) with the bytes that compose your price column (which are 5 bytes, also assumed). This is the CAST(P.[date] AS BINARY(3)) + CAST(P.[price] AS BINARY(5))
part.
當(dāng)你計算這個和 ORDER BY P.[date] ROWS UNBOUNDED PRECEDING
時,引擎基本上是滾動最大值,其中最重要的字節(jié)是你的日期.當(dāng)日期更改時,最大值結(jié)果將始終更新,但考慮到將任何值與 NULL
作為價格連接也會產(chǎn)生 NULL
(作為二進(jìn)制),那么 MAX
將始終忽略此值并保留之前的非空 MAX
(按 P.[date] ROWS UNBOUNDED PRECEDING
).
When you calculate this and ORDER BY P.[date] ROWS UNBOUNDED PRECEDING
, the engine is basically doing rolling max with values which most significant bytes are your dates. The max result will always update when the date changes, but considering that concatenating any value with NULL
as price will also yield NULL
(as binary), then the MAX
will always ignore this value and retain the previous non-null MAX
(by P.[date] ROWS UNBOUNDED PRECEDING
).
這是窗口化 MAX
的二進(jìn)制結(jié)果(我添加了一個帶有 NULL
的前一條記錄,所以你看到結(jié)果是 NULL
表示 null價格值):
This is the binary result of the windowed MAX
(I added a previous record with NULL
so you see that result is NULL
for null prices values):
date price ff_price WindowedMax
2016-07-10 NULL NULL NULL
2016-07-11 0.79 0.79 0x9B3B0B050200014F
2016-07-12 NULL 0.79 0x9B3B0B050200014F
2016-07-13 NULL 0.79 0x9B3B0B050200014F
2016-07-14 0.69 0.69 0x9E3B0B0502000145
2016-07-15 NULL 0.69 0x9E3B0B0502000145
2016-07-21 0.88 0.88 0xA53B0B0502000158
2016-07-22 NULL 0.88 0xA53B0B0502000158
這篇關(guān)于使用 T-SQL 在時間序列數(shù)據(jù)中前向填充空值的有效方法的文章就介紹到這了,希望我們推薦的答案對大家有所幫助,也希望大家多多支持html5模板網(wǎng)!