word2vec,如何使用 SVM 对文本进行分类?

我有一个 csv 文件,它有 2 列:class 和 text_data。我首先提取 biGram 和 TriGrams,然后尝试在我的数据上使用 SVM 进行分类。但它显示“类型错误:序列项 0:预期是一个类似字节的对象,已找到 str”。我用过 Gensim=4.0.0。非常感谢您的帮助。代码:

# all packages imported
df_covid = pd.read_csv('allCurated853_4.csv',encoding="utf8")
df_covid['label'] = df_covid['class'].map({
    'covidshutdown': 0,'manufactoring':1,'corporate':2,'environmental':3,'infrastructure':4,'other':5
})
X = df_covid['text_data']
y = df_covid['label']
corpus=X
lst_corpus = []
for string in corpus:
lst_words = string.split()
lst_grams = [" ".join(lst_words[i:i+1])
            for i in range(0,len(lst_words),1)]
lst_corpus.append(lst_grams)

bigrams_detector = gensim.models.phrases.Phrases(lst_corpus,delimiter=" ".encode(),min_count=5,threshold=10)
bigrams_detector = gensim.models.phrases.Phraser(bigrams_detector)
trigrams_detector = gensim.models.phrases.Phrases(bigrams_detector[lst_corpus],threshold=10)
trigrams_detector = gensim.models.phrases.Phraser(trigrams_detector)

cv = gensim.models.word2vec.Word2Vec(lst_corpus,size=300,window=8,min_count=1,sg=1,iter=30)

X_trainCv,X_testCv,y_trainCv,y_testCv = train_test_split(cv,y,test_size=0.20,random_state=42)
clf = svm.SVC(kernel='linear').fit(X_trainCv,y_trainCv)
y_pred = clf.predict(X_testCv)
print(classification_report(X_testCv,y_pred))

完整的错误信息:

word2vec,如何使用 SVM 对文本进行分类?

这是一个数据集 click here to download

如果,我不使用 biGrams 和 triGrams 并将 word2Vec 模型更改为此

cv= gensim.models.Word2Vec(lst_corpus,vector_size=100,window=5,workers=4)

出现新的错误信息:

回溯(最近一次调用最后一次):

File "D:\Dropbox\AAA\50-2\word2Vec_trails\word2Vec_triGrams_34445.py",line 65,in <module>
X_trainCv,random_state=42)

File "C:\Users\makta\anaconda3\lib\site-packages\sklearn\model_selection\_split.py",line 2172,in train_test_split
arrays = indexable(*arrays)

File "C:\Users\makta\anaconda3\lib\site-packages\sklearn\utils\validation.py",line 299,in indexable
check_consistent_length(*result)

File "C:\Users\makta\anaconda3\lib\site-packages\sklearn\utils\validation.py",line 259,in check_consistent_length
lengths = [_num_samples(X) for X in arrays if X is not None]

File "C:\Users\makta\anaconda3\lib\site-packages\sklearn\utils\validation.py",in <listcomp>
lengths = [_num_samples(X) for X in arrays if X is not None]

  File "C:\Users\makta\anaconda3\lib\site-packages\sklearn\utils\validation.py",line 202,in _num_samples
raise TypeError("Singleton array %r cannot be considered"

TypeError: Singleton array array(<gensim.models.word2vec.Word2Vec object at 0x000001A41DF59820>,dtype=object) cannot be considered a valid collection.

我尝试(在代码下方)使用 word2vec 向量,如计数向量化器或 TFIDF。它生成输出,但不是正确的。我想我应该列出向量列表。帮助表示赞赏。这是代码:

from sklearn import svm
from sklearn.metrics import classification_report
from sklearn.model_selection import train_test_split
df_covid = pd.read_csv('allCurated853_4.csv',encoding="utf8")
df['label'] = df['class'].map({'covidshutdowncsv': 0,'manufactoringcsv':1,'corporatecsv':2,'environmentalcsv':3,'infrastructurecsv':4,'other':5})
X = df['message']
y = df['label']
X=X.to_string ()
ls = []
rows = X.splitlines(True)
print('size of rows:',len(rows))   # size of rows: 852
for i in rows:
    ls.append(i.split(' '))
print('total words:',len(ls))    # total words: 852
model = Word2Vec(ls,size = 4)
words = list(model.wv.vocab)
print('words in vocabolary :',len(words)) # words in vocabolary:3110
print(words)
words=words[0:852]    # problem
vectors = []
for word in words:
    vectors.append(model[word].tolist())
data = np.array(vectors)    
print('vectors of words:',len(data))   # vectors of words: 852     
X_trainCv,y_testCv = train_test_split(data,random_state=42)
clf_covid = svm.SVC(kernel='linear').fit(X_trainCv,y_trainCv)
clf_covid.score(X_testCv,y_testCv)
# score: 0.50299
bill58702738 回答:word2vec,如何使用 SVM 对文本进行分类?

暂时没有好的解决方案,如果你有好的解决方案,请发邮件至:iooj@foxmail.com
本文链接:https://www.f2er.com/50077.html

大家都在问