我真的是机器学习的新生。我正在查看将电子邮件中的垃圾邮件或火腿值分开的代码。为另一个数据集设置代码时出现问题。因此,我的数据集不仅具有火腿或垃圾邮件值。我有2个不同的分类值(年龄和性别)。当我尝试在下面的代码块中使用2个分类值时,出现错误,表示解包的值太多。我怎样才能体现我的全部价值观?
from sklearn.model_selection import train_test_split
X_train,X_test,y_train,y_test = train_test_split(messages_bow,import_data['age'],import_data['gender'],test_size = 0.20,random_state = 0)
整个代码:
import numpy as np
import pandas
import nltk
from nltk.corpus import stopwords
import string
# Import Data.
import_data = pandas.read_csv('/root/Desktop/%20/%100.csv',encoding='cp1252')
# To See Columns Headers.
print(import_data.columns)
# To Remove Duplications.
import_data.drop_duplicates(inplace = True)
# To Find Data Size.
print(import_data.shape)
#Tokenization (a list of tokens),will be used as the analyzer
#1.Punctuations are [!"#$%&'()*+,-./:;<=>?@[\]^_`{|}~]
#2.Stop words in natural language processing,are useless words (data).
def process_text(text):
'''
What will be covered:
1. Remove punctuation
2. Remove stopwords
3. Return list of clean text words
'''
#1
nopunc = [char for char in text if char not in string.punctuation]
nopunc = ''.join(nopunc)
#2
clean_words = [word for word in nopunc.split() if word.lower() not in stopwords.words('english')]
#3
return clean_words
#Show the Tokenization (a list of tokens )
print(import_data['text'].head().apply(process_text))
# Convert the text into a matrix of token counts.
from sklearn.feature_extraction.text import CountVectorizer
messages_bow = CountVectorizer(analyzer=process_text).fit_transform(import_data['text'])
#Split data into 80% training & 20% testing data sets
from sklearn.model_selection import train_test_split
X_train,import_data['frequency'],random_state = 0)
#Get the shape of messages_bow
print(messages_bow.shape)