您可以实施公式:
val df_ll = df.withColumn("logloss",-($"target_col" * log($"predicted_col") + (lit(1) - $"target_col") * log(lit(1) - $"predicted_col")))
请注意,我们仅使用spark.sql.functions中的内置函数,这意味着我们可以获得相当好的性能(优于UDF)
,
因此,这是一个方便的函数,用于在给定数据帧的情况下计算对数损失(TARGET_COL
是地面真实性aka标签,PREDICTED_COL
是模型返回的预测):
def calcualteLogLoss (df: Dataset[Row]): Double = {
val df2 = df.withColumn("logloss",col(TARGET_COL).multiply(lit(-1))
.multiply(log(col(PREDICTED_COL)))
.minus(
lit(1.0).minus(col(TARGET_COL))
.multiply(log(lit(1).minus(col(PREDICTED_COL))))
)
)
// calculate desired logloss as average of all samples
val loglossA = df2.agg(mean("logloss").alias("ll")).collect()
var logloss = -1d
if (loglossA != null && loglossA.length > 0){
val loglossB = loglossA.head
if (loglossB != null && loglossB.length > 0 && loglossB.get(0) != null) {
logloss = loglossB.getDouble(0)
}
}
logloss
}
,
这个简单的函数可以用来获取log loss。
from pyspark.sql.types import *
import pyspark.sql.functions as F
def logloss(predictions,label_column):
"""
function for calculation of log loss
:param spark.DataFrame predictions : model predicted spark dataframe
:param string label_column : column name containing true labels
:return float log_loss
"""
get_first_element = F.udf(lambda v:float(v[1]),FloatType())
predictions = predictions.select(F.col(label_column).alias("label"),"probability")
predictions = predictions.withColumn("true_prediction_probability",\
get_first_element(F.col("probability"))).drop("probability")
predictions = predictions.withColumn("logloss",-(F.col("label") * F.log(F.col("true_prediction_probability")) + \
(F.lit(1) - F.col("label")) * F.log(F.lit(1) - F.col("true_prediction_probability"))))
log_loss = predictions.select(F.mean("logloss")).collect()[0][0]
return log_loss
本文链接:https://www.f2er.com/3133937.html