基于LDA主题模型从字典中显示词频

2024-05-19 • 问答

我是R语言中的Text Mining的新手。我刚开始从事一个项目，该项目基于主题模型确定语料库中的词频，该主题模型是通过在另一个语料库上运行LDA算法生成的。

我从语料库A接收了5个主题模型的前100个术语，并希望确定这些术语在语料库B中的出现频率。我为每个主题的前100个术语创建了5个数据框。但是，为Corpus B运行dtm时出现以下错误：

Error in tdm(txt,isTRUE(control$removePunctuation),isTRUE(control$removeNumbers),: 
  Expecting a string vector: [type=list; required=STRSXP].

有人知道如何将每个主题的主要术语保存为词典，并根据这些词典运行Corpus B的dtm吗？

当我查看当前代码时，我认为更容易理解我的程序：

corpusA <- Corpus(VectorSource(A))
corpusB <- Corpus(VectorSource(B))

#Preprocessing#

dtmA<- DocumentTermMatrix(corpusA,control=list(wordLengths=c(3,28),bounds = list(global = c(5,Inf))))
ap_lda <- LDA(dtmA,k=5,control = list(seed=1234))
ap_lda

ap_topics <- tidy(ap_lda,matrix = "beta")
ap_topics

ap_top_terms <- ap_topics %>%
  group_by(topic) %>%
  top_n(100,beta) %>%
  ungroup() %>%
  arrange(topic,-beta)

terms_dataframe <- as.data.frame(ap_top_terms)
topic1 <- terms_dataframe[ terms_dataframe$topic == 1,]
topic1_words <- topic1[2] #current data frame now only consists of the topic terms
#same procedure for the other 4 topics 

dtm_topic1 <- DocumentTermMatrix(corpusB,control=list(dictionary = topic1_words))

也许有一些更好的方法可以解决我的“问题”。如果有人可以帮助我，将非常高兴:)

基于LDA主题模型从字典中显示词频

mmaiyy02 回答：基于LDA主题模型从字典中显示词频

大家都在问