我有一个来自twitter的评论数据集(例如10个实例)。我想使用Scikit-learn Python作为输出来对相似的单词进行分类和计数,如下所示:
**Dataset:**
comment_text
r u cmng or u not cmng
I am fine,r u fine
my frnd is gr8,wll dn.
we r nt going tday
I have a fever.
应显示为以下输出
Words Count
u 3
r 3
i 2
cmng 2
fine,1
wll 1
have 1
fever. 1
not 1
tday 1
my 1
we 1
a 1
or 1
nt 1
going 1
fine 1
dn. 1
gr8,1
frnd 1
am 1
is 1
dtype: int64
我使用此代码,但显示错误的输出
text = train_dataset_male['comment_text']
print(text)
vectorizer = TfidfVectorizer()
# tokenize and build vocab
vectorizer.fit(text)
# summarize
print(vectorizer.vocabulary_)
print(vectorizer.idf_)
# encode document
vector = vectorizer.transform([text[0]])
# summarize encoded vector
print(vector.shape)
print(vector.toarray())