本文介绍了为什么Precision_Recall_Curve()返回的值与混淆矩阵不同?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!
问题描述
我编写了以下代码来计算多类分类问题的精度和召回率:
import numpy as np
import matplotlib.pyplot as plt
from itertools import cycle
from sklearn import svm, datasets
from sklearn.metrics import roc_curve, auc, precision_recall_curve
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import label_binarize
from sklearn.multiclass import OneVsRestClassifier
from sklearn.metrics import roc_auc_score
def find_nearest(array, value):
array = np.asarray(array)
idx = (np.abs(array - value)).argmin()
return idx
# Import some data to play with
iris = datasets.load_iris()
X = iris.data
y = iris.target
# Binarize the output
y = label_binarize(y, classes=[0, 1, 2])
n_classes = y.shape[1]
# Add noisy features to make the problem harder
random_state = np.random.RandomState(0)
n_samples, n_features = X.shape
X = np.c_[X, random_state.randn(n_samples, 200 * n_features)]
# shuffle and split training and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.5, random_state=0)
# Learn to predict each class against the other
classifier = OneVsRestClassifier(
svm.SVC(kernel="linear", probability=True, random_state=random_state)
)
classifier.fit(X_train, y_train)
y_score = classifier.decision_function(X_test)
# Confusion matrix
from sklearn.metrics import classification_report
y_test_pred = classifier.predict(X_test)
print(classification_report(y_test, y_test_pred))
# Compute ROC curve and ROC area for each class
precision = dict()
recall = dict()
threshold = dict()
for i in range(n_classes):
c = classifier.classes_[i]
precision[c], recall[c], threshold[c] = precision_recall_curve(y_test[:, c], y_score[:, c])
th0 = find_nearest(threshold[c], 0)
print(c, round(precision[c][th0],2), round(recall[c][th0], 2))
我要做的是重新计算混淆矩阵所显示的精确度和召回率
precision recall f1-score support
0 0.73 0.52 0.61 21
1 1.00 0.07 0.12 30
2 0.57 0.33 0.42 24
micro avg 0.68 0.28 0.40 75
macro avg 0.77 0.31 0.39 75
weighted avg 0.79 0.28 0.36 75
samples avg 0.28 0.28 0.28 75
使用precision_recall_curve()
函数。理论上,当阈值等于0时,它应该返回与混淆矩阵完全相同的结果。
但是,我的结果与最终结果不匹配:
precsion recall
0 0.75 0.57
1 1.0 0.1
2 0.6 0.38
您能否解释这种差异,以及如何正确计算混淆矩阵报告的值?
推荐答案
正如我在评论中所写的,考虑索引th0 + 1
而不是索引th0
可以解决您的问题。然而,这可能只是一种情况(因为在这个特定的例子中,接近0的阈值总是对应于负分数);因此,对于编程方法,您应该修改find_nearest
以返回threshold
为正且最接近0的索引。事实上,您可以通过添加
print(th0, threshold[c][th0-1], threshold[c][th0], threshold[c][th0+1])
您将获得以下输出:
20 -0.011161920989200713 -0.01053513227868108 0.016453546101096173
67 -0.04226738229343663 -0.0074193008862454835 0.09194626401603534
38 -0.011860865951094923 -0.003756310149749531 0.0076752136658660985
要获得更具程序化的方法,您可以简单地按如下方式修改find_nearest
并将索引th0
保留在您的循环中。
def find_nearest_new(array, value):
array = np.asarray(array)
idx = (np.abs(np.where(array > 0, array, 999) - value)).argmin()
return idx
...
for i in range(n_classes):
c = classifier.classes_[i]
precision[c], recall[c], threshold[c] = precision_recall_curve(y_test[:, c], y_score[:, c])
th0 = find_nearest_new(threshold[c], 0)
print(c, round(precision[c][th0],6), round(recall[c][th0], 6), round(threshold[c][th0],6))
我的线索是这样的,即在precision_recall_curve
实现中,精度和查全率的定义如下:
精度:ndarray形状(n_Thresholds+1,) 精确值,元素i是Score>;=Thresholds[i]的预测精度,最后一个元素是1。
Recall:ndarray of Shape(n_Thresholds+1,) 降低召回值,以便元素I是Score&>=Thresholds[i]的预测,最后一个元素为0。换句话说,如果您按降序对分数进行排序(根据实现),您将看到所选的阈值(无论您是否考虑索引
th0 + 1
)与每个类的第一个正分数一致(实际上,阈值就是不同的评分值)。另一方面,如果您坚持索引th0
(在此特定示例中),您将获得严格小于Threshold=0的分数。
for i in range(n_classes):
c = classifier.classes_[i]
precision[c], recall[c], threshold[c] = precision_recall_curve(y_test[:, c], y_score[:, c])
th0 = find_nearest(threshold[c], 0)
print(c, round(precision[c][th0+1],6), round(recall[c][th0+1], 6), round(threshold[c][th0+1],6))
#print(c, precision[c], recall[c], threshold[c])
print(np.sort(y_score[:,c])[::-1])
This post可能有助于了解precision_recall_curve()
中的工作原理。
这篇关于为什么Precision_Recall_Curve()返回的值与混淆矩阵不同?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持编程学习网!
本站部分内容来源互联网,如果有图片或者内容侵犯您的权益请联系我们删除!