sklearn train_test_split返回测试/训练中的一些元素

我有一个数据集library(dplyr) library(stringr) df %>% mutate_all(~ replace(as.character(.),str_detect(.,"^\\d{6}$",negate = TRUE),NA)) %>% transmute(v7 = coalesce(!!! .)) ,其中包含260个独特的观测值。

在运行X时,我认为 x_train,x_test,_,_=test_train_split(X,y,test_size=0.2)将为空,但不是。实际上,事实证明[p for p in x_test if p in x_train]中只有两个观测值不在x_test中。

那是故意的还是...?

编辑(发布我正在使用的数据):

x_train

EDIT 2.0:表明测试有效

from sklearn.datasets import load_breast_cancer 
from sklearn.model_selection import train_test_split as split
import numpy as np

DATA=load_breast_cancer()
X=DATA.data
y= DATA.target
y=np.array([1 if p==0 else 0 for p in DATA.target])

x_train,y_train,y_test=split(X,test_size=0.2,stratify=y,random_state=42)

len([p for p in x_test if p in x_train]) #is not 0

d4133456929295 回答:sklearn train_test_split返回测试/训练中的一些元素

这不是train_test_splitsklearn的实现的错误,而是in运算符如何在numpy数组上工作的怪异特性。 in运算符首先在两个数组之间进行逐元素比较,如果任何元素匹配,则返回True

import numpy as np

a = np.array([[1,2,3],[4,5,6]])
b = np.array([[6,7,8],[5,5]])
a in b # True

测试这种重叠的正确方法是使用相等运算符以及np.allnp.any。另外,您还可以免费获得重叠的索引。

import numpy as np

a = np.array([[1,5],[7,8,9]])
a in b # True

z = np.any(np.all(a == b[:,None,:],-1))  # False

a = np.array([[1,[1,9]])
a in b # True

overlap = np.all(a == b[:,-1)
z = np.any(overlap)  # True
indices = np.nonzero(overlap)  # (1,0)
,

您需要使用以下内容进行检查:

from sklearn.datasets import load_breast_cancer 
from sklearn.model_selection import train_test_split as split
import numpy as np

DATA=load_breast_cancer()
X=DATA.data
y= DATA.target
y=np.array([1 if p==0 else 0 for p in DATA.target])

x_train,x_test,y_train,y_test=split(X,y,test_size=0.2,stratify=y,random_state=42)

len([p for p in x_test.tolist() if p in x_train.tolist()])
0

使用x_test.tolist() in运算符将按预期工作。

参考:testing whether a Numpy array contains a given row

本文链接:https://www.f2er.com/2997695.html

大家都在问