当前,我遇到了一个问题,当我将关键字列表转换为字典时,无法将原始数据集中的频率乘以将列表转换为字典时的频率。
rowid关键字频率
1皮肤科1151
2精神病学1068
3妇产科1017
4种内科药物883
5心理健康865
6验光763
7名儿科医生678
8个儿科622
我正在尝试使用LDA和tfidfmodel对一些搜索关键字进行聚类。在我的数据集中,我有一个关键字列表及其频率。我正在尝试使用频率#对基于这些关键字的主题进行聚类。
data_text = data[['Keyword']]
data_text['index']=data_text.index
documents = data_text
#Pre-Processing steps lemmatize and stemming
def lemmatize_stemming(text):
return stemmer.stem(WordNetLemmatizer().lemmatize(text,pos='v'))
#def lemmatize_stemming(text):
# return stemmer.stem(text)
def preprocess(text):
result=[]
for token in gensim.utils.simple_preprocess(text):
if token not in gensim.parsing.preprocessing.STOPWORDS and len(token)>3:
result.append(lemmatize_stemming(token))
return result
doc_sample = documents[documents['index']==4310].values[0][0]
print('original document:')
words = []
for word in doc_sample.split(' '):
words.append(word)
print(words)
print('\n\n tokenize and lemmatize document: ')
print(preprocess(doc_sample))
documents['Keyword']=documents['Keyword'].astype(str)
processed_docs = documents['Keyword'].map(preprocess)
processed_docs[:]
dictionary = gensim.corpora.Dictionary(processed_docs)
count=0
for k,v in dictionary.iteritems():
print(k,v)
count +=1
if count > 10:
break
[[(dictionary[id],freq*) for id,freq in cp] for cp in bow_corpus[:1]]
print(bow_corpus)
#create dic reporting how many words and how many times those words appear
bow_corpus = [dictionary.doc2bow(doc) for doc in processed_docs]
bow_corpus
不确定准备这些关键字以更好地进行聚类和查找主题的最佳方法是什么。请告知。