我有一个数据集library(dplyr)
library(stringr)
df %>%
mutate_all(~ replace(as.character(.),str_detect(.,"^\\d{6}$",negate = TRUE),NA)) %>%
transmute(v7 = coalesce(!!! .))
,其中包含260个独特的观测值。
在运行X
时,我认为
x_train,x_test,_,_=test_train_split(X,y,test_size=0.2)
将为空,但不是。实际上,事实证明[p for p in x_test if p in x_train]
中只有两个观测值不在x_test
中。
那是故意的还是...?
编辑(发布我正在使用的数据):
x_train
EDIT 2.0:表明测试有效
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split as split
import numpy as np
DATA=load_breast_cancer()
X=DATA.data
y= DATA.target
y=np.array([1 if p==0 else 0 for p in DATA.target])
x_train,y_train,y_test=split(X,test_size=0.2,stratify=y,random_state=42)
len([p for p in x_test if p in x_train]) #is not 0