支持率= 0的SVM和随机森林

我正在尝试从可能出现在“退出”列中的两个值中预测一个。我有干净的数据(大约20列和4k行包含有关“性别”,“年龄”等客户的典型信息)。在培训数据集中,约20%的客户被评为“ 1”。我制作了两个模型-svm和随机森林-但都预测测试数据集大多为“ 0”(几乎每次)。召回两个模型为0。 我在代码中认为可以犯一些愚蠢的错误。有什么想法为什么在80%的准确度下召回率如此之低?

def ml_model():
    print('sklearn: %s' % sklearn.__version__)
    df = pd.read_csv('clean_data.csv')
    df.head()
    feat = df.drop(columns=['target'],axis=1)
    label = df["target"]
    x_train,x_test,y_train,y_test = train_test_split(feat,label,test_size=0.3)
    sc_x = StandardScaler()
    x_train = sc_x.fit_transform(x_train)

    # SVC method
    support_vector_classifier = SVC(probability=True)
    # Grid search
    rand_list = {"C": stats.uniform(0.1,10),"gamma": stats.uniform(0.1,1)}
    auc = make_scorer(roc_auc_score)
    rand_search_svc = RandomizedSearchCV(support_vector_classifier,param_distributions=rand_list,n_iter=100,n_jobs=4,cv=3,random_state=42,scoring=auc)
    rand_search_svc.fit(x_train,y_train)
    support_vector_classifier = rand_search_svc.best_estimator_
    cross_val_svc = cross_val_score(estimator=support_vector_classifier,X=x_train,y=y_train,cv=10,n_jobs=-1)
    print("Cross Validation accuracy for SVM: ",round(cross_val_svc.mean() * 100,2),"%")
    predicted_y = support_vector_classifier.predict(x_test)
    tn,fp,fn,tp = confusion_matrix(y_test,predicted_y).ravel()
    precision_score = tp / (tp + fp)
    recall_score = tp / (tp + fn)
    print("Recall score SVC: ",recall_score)


    # Random forests
    random_forest_classifier = RandomForestClassifier()
    # Grid search
    param_dist = {"max_depth": [3,None],"max_features": sp_randint(1,11),"min_samples_split": sp_randint(2,"bootstrap": [True,False],"criterion": ["gini","entropy"]}
    rand_search_rf = RandomizedSearchCV(random_forest_classifier,param_distributions=param_dist,cv=5,iid=False)
    rand_search_rf.fit(x_train,y_train)
    random_forest_classifier = rand_search_rf.best_estimator_
    cross_val_rfc = cross_val_score(estimator=random_forest_classifier,n_jobs=-1)
    print("Cross Validation accuracy for RF: ",round(cross_val_rfc.mean() * 100,"%")
    predicted_y = random_forest_classifier.predict(x_test)
    tn,predicted_y).ravel()
    precision_score = tp / (tp + fp)
    recall_score = tp / (tp + fn)
    print("Recall score RF: ",recall_score)

    new_data = pd.read_csv('new_data.csv')
    new_data = cleaning_data_to_predict(new_data)
    if round(cross_val_svc.mean() * 100,2) > round(cross_val_rfc.mean() * 100,2):
        predictions = support_vector_classifier.predict(new_data)
        predictions_proba = support_vector_classifier.predict_proba(new_data)
    else:
        predictions = random_forest_classifier.predict(new_data)
        predictions_proba = random_forest_classifier.predict_proba(new_data)

    f = open("output.txt","w+")
    for i in range(len(predictions.tolist())):
        print("id: ",i,"probability: ",predictions_proba.tolist()[i][1],"exit: ",predictions.tolist()[i],file=open("output.txt","a"))
zchyqwerty 回答:支持率= 0的SVM和随机森林

如果我没有错过它,您忘了扩展测试集。 因此,您还需要扩展它。请注意,您应该只对其进行转换,而不再适合它。见下文。

x_test = sc_x.transform(x_test)
,

我同意@e_kapti,也请检查召回率和准确性的公式,您可以考虑改用F1分数(https://en.wikipedia.org/wiki/F1_score)。

召回率= TP /(TP + FN)精度=(TP + TN)/(TP + TN + FP + FN)TP,FP,TN,FN为真阳性,假阳性,真阴性和假的数量底片。

本文链接:https://www.f2er.com/3114233.html

大家都在问