如何自定义基于tfidf的单词数

2024-05-05 • 问答

当前，我遇到了一个问题，当我将关键字列表转换为字典时，无法将原始数据集中的频率乘以将列表转换为字典时的频率。

rowid关键字频率

1皮肤科1151

2精神病学1068

3妇产科1017

4种内科药物883

5心理健康865

6验光763

7名儿科医生678

8个儿科622

我正在尝试使用LDA和tfidfmodel对一些搜索关键字进行聚类。在我的数据集中，我有一个关键字列表及其频率。我正在尝试使用频率＃对基于这些关键字的主题进行聚类。

data_text = data[['Keyword']]
data_text['index']=data_text.index
documents = data_text

#Pre-Processing steps lemmatize and stemming

def lemmatize_stemming(text):
    return stemmer.stem(WordNetLemmatizer().lemmatize(text,pos='v'))

#def lemmatize_stemming(text):
#    return stemmer.stem(text)

def preprocess(text):
    result=[]
    for token in gensim.utils.simple_preprocess(text):
        if token not in gensim.parsing.preprocessing.STOPWORDS and len(token)>3:
            result.append(lemmatize_stemming(token))
    return result

doc_sample = documents[documents['index']==4310].values[0][0]
print('original document:')
words = []
for word in doc_sample.split(' '):
    words.append(word)
print(words)
print('\n\n tokenize and lemmatize document: ')
print(preprocess(doc_sample))

documents['Keyword']=documents['Keyword'].astype(str)

processed_docs = documents['Keyword'].map(preprocess)
processed_docs[:]

dictionary = gensim.corpora.Dictionary(processed_docs)

count=0
for k,v in dictionary.iteritems():
    print(k,v)
    count +=1
    if count > 10: 
        break

[[(dictionary[id],freq*) for id,freq in cp] for cp in bow_corpus[:1]]
print(bow_corpus)

#create dic reporting how many words and how many times those words appear
bow_corpus = [dictionary.doc2bow(doc) for doc in processed_docs]
bow_corpus

如何自定义基于tfidf的单词数

zyr137583910 回答：如何自定义基于tfidf的单词数

大家都在问