训练测试拆分时间序列数据而不混合数据的正确方法

2024-05-19 • 问答

我从每个二元类的36个主题中收集了数据，因此总共有72个主题。例如，一个时间序列的长度为30000。我将30000数组分割为10x3000。并计算出功能。假设特征数量为5。因此我的输出将为10 * 5。对于72个主题，则为72 * 10 * 5。

我想在训练和测试中将它们拆分，这样，如果某个主体在训练中，则其任何部分都不应测试。受试者的所有部分都应该在训练中或接受测试。对于特定主题，不应同时将其划分。

我做什么：

X_train,X_test,y_train,y_test = train_test_split(X,y,test_size=0.2,random_state=42,stratify=y)

它将返回以下形状的数据：

X_train.shape,X_test.shape,y_train.shape,y_test.shape
((57,10,5),(15,(57,),))

我更改了特征的形状，以便可以将其提供给机器学习算法：

X_train = X_train.reshape(-1,5)
X_test = X_test.reshape(-1,5)

现在X_train的形状为(570,5)，X_test的形状为(150,5)，但是它们对应的y_train的形状为(57,)，y_test的形状为(15,)。

问题是：如何匹配相应标签的形状？

在这种特殊情况下，段的数量为10，但是对于每个主题它们可以不同。有些主题可以有9个部分，有些可以有11个部分，依此类推。因为长度30,000不固定。

还有一件事：如何进行交叉验证？

编辑我尝试了以下方法。你能确认一下吗？这是正确的方法吗？

skf = StratifiedKFold(n_splits=5)
skf.get_n_splits(X,y)

acc=[]
for train_index,test_index in skf.split(X,y):
    #print("TRAIN:",train_index,"TEST:",test_index)
    X_train,X_test = X[train_index],X[test_index]
    y_train,y_test = y[train_index],y[test_index]

    X_train=X_train.reshape(-1,450)
    X_test=X_test.reshape(-1,450)

    y_train = np.array( [ele for ele in y_train for i in range(10)] )
    y_test =np.array([ele for ele in y_test for i in range(10)] )  



    scaler = StandardScaler()
    X_train=scaler.fit_transform(X_train)
    X_test=scaler.transform(X_test)


    clf = SVC(kernel="rbf")
    clf.fit(X_train,y_train)
    y_pred = clf.predict(X_test)
    accuracy=accuracy_score(y_test,y_pred)
    print(accuracy)
    acc.append(accuracy)

print('-------------------------')    
print('mean',np.array(acc).mean())    
print('std',np.array(acc).std())

训练测试拆分时间序列数据而不混合数据的正确方法

zhu033033 回答：训练测试拆分时间序列数据而不混合数据的正确方法

大家都在问