問題描述
PROBNORM:解釋
PROBNORM : explanation
SAS 中的 PROBNORM 函數返回標準正態分布的觀測值小于或等于 x 的概率.
The PROBNORM function in SAS returns the probability that an observation from the standard normal distribution is less than or equal to x.
pyspark中有沒有等價的功能?
Is there any equivalent function in pyspark?
推薦答案
恐怕PySpark中沒有這樣的實現方法.
但是,您可以利用 Pandas UDF 使用基本的 Python 包定義您自己的自定義函數!這里我們將使用 scipy.stats.norm
模塊從標準正態分布中獲取累積概率.
I'm afraid that in PySpark there is no such implemented method.
However, you can exploit Pandas UDFs to define your own custom function using basic Python packages! Here we are going to use scipy.stats.norm
module to get cumulative probabilities from a standard normal distribution.
我正在使用的版本:
Spark 3.1.1
熊貓 1.1.5
scipy 1.5.2
示例代碼
import pandas as pd
from scipy.stats import norm
import pyspark.sql.functions as F
from pyspark.sql.functions import pandas_udf
# create sample data
df = spark.createDataFrame([
(1, 0.00),
(2, -1.23),
(3, 4.56),
], ['id', 'value'])
# define your custom Pandas UDF
@pandas_udf('double')
def probnorm(s: pd.Series) -> pd.Series:
return pd.Series(norm.cdf(s))
# create a new column using the Pandas UDF
df = df.withColumn('pnorm', probnorm(F.col('value')))
df.show()
+---+-----+-------------------+
| id|value| pnorm|
+---+-----+-------------------+
| 1| 0.0| 0.5|
| 2|-1.23|0.10934855242569191|
| 3| 4.56| 0.9999974423189606|
+---+-----+-------------------+
編輯
如果您的工作人員也沒有正確安裝 scipy
,您可以使用 Python 基礎包 math
和一點 統計知識.
Edit
If you do not have scipy
properly installed on your workers too, you can use the Python base package math
and a little bit of statistics knowledge.
import math
from pyspark.sql.functions import udf
def normal_cdf(x, mu=0, sigma=1):
"""
Cumulative distribution function for the normal distribution
with mean `mu` and standard deviation `sigma`
"""
return (1 + math.erf((x - mu) / (sigma * math.sqrt(2)))) / 2
my_udf = udf(normal_cdf)
df = df.withColumn('pnorm', my_udf(F.col('value')))
df.show()
+---+-----+-------------------+
| id|value| pnorm|
+---+-----+-------------------+
| 1| 0.0| 0.5|
| 2|-1.23|0.10934855242569197|
| 3| 4.56| 0.9999974423189606|
+---+-----+-------------------+
結果其實是一樣的.
這篇關于pyspark中的probnorm函數等效的文章就介紹到這了,希望我們推薦的答案對大家有所幫助,也希望大家多多支持html5模板網!