R中的向量列表-提取向量的元素

2024-06-02 • 问答

我有一个包含一些文本的列表。因此，列表的每个元素都是一个文本。文本是单词的向量。所以我有一个向量列表。我正在对此进行一些文本挖掘。现在，我试图提取单词“ no”之后的单词。我转换了向量，所以现在它们是两个单词的向量。如： list(c("want friend","friend funny","funny nice","nice glad","glad become","become no","no more","more guys"),c("no comfort","comfort written","written conduct","conduct prevent","prevent manners","matters no","no one","one want","want be","be fired"))

我的目标是要有一个向量列表，如下所示： list(c("more"),c("comfort","one")) 这样我就可以通过liste [i]在文本中看到结果的矢量。

因此，我有一个公式来提取“ no”之后的单词（在第一个向量中为“ more”）。但是，如果我的文字中有几个“否”，那是行不通的。

这是我的代码：

liste_negation <- vector(length = length(data))
for (i in 1:length(data)){
  for (j in 1:length(data[[i]])){
    if (startsWith((data[[i]])[[j]],'no') == TRUE){
      liste_neg[i] <- c(liste_neg[i],tail(strsplit((data[[i]])[[j]],split=" ")[[1]],1))
    } else{
      liste_neg[i] <- c(liste_neg[i])
    }
    liste_negation[[i]] <- c(liste_neg[[i]])
  }
}

只有一个“ no”时，该向量才有效：

data <- list(c("want friend","be fired"))
data

liste_neg <- c()
liste_negation <- vector(length = length(data))
if (startsWith((data[[1]])[[9]],'no') == TRUE){
  liste_neg[1] <- c(liste_neg[1],tail(strsplit((data[[1]])[[9]],1))
}

liste_negation[[1]] <- c(liste_neg[[1]])

但是，如果我尝试通过循环来适应它，以查看向量的每个元素，并且文本中有多个“否”，那么它将不起作用。

代码：

liste_neg <- c()
liste_negation <- vector(length = length(data))
for (j in 1:length(data[[2]])){
  if (startsWith((data[[2]])[[j]],'no') == TRUE){
    liste_neg[2] <- append(liste_neg[2],tail(strsplit((data[[2]])[[j]],1))
  }
}
liste_neg
liste_negation[[2]] <- c(liste_neg[[2]])
liste_negation

警告消息：

Warning message:
In liste_neg[2] <- append(liste_neg[2],:
  number of items to replace is not a multiple of replacement length
> liste_neg
[1] NA        "comfort"
> liste_negation[[2]] <- c(liste_neg[[2]])
> liste_negation
[1] "FALSE"   "comfort"

如您所见，我只有第二个单词。

我尝试了很多事情，试图拆分代码，然后逐段运行并进行处理，但是花了整整一个上午的时间，我仍然没有找到解决方案。

有人有什么想法可以帮助我吗？

在此先感谢您（对不起我的英语，我是法语^^'）

h520zh 回答：R中的向量列表-提取向量的元素

在基数R中，我们可以使用sapply遍历列表，并使用grep识别带有"no"的单词

output <- sapply(word_vec,function(x) sub(".*no","",grep("\\bno\\b",x,value = TRUE)))

#[[1]]
#[1] ""      " more"

#[[2]]
#[1] " comfort" ""         " one"

如果您不需要空字符串，可以将其删除以获取

sapply(output,function(x) trimws(x[x!= ""]))  
#[[1]]
#[1] "more"

#[[2]]
#[1] "comfort" "one"

lapply(data,function(x) substr(x[startsWith(x,"no")],4,1000))


[[1]]
[1] "more"

[[2]]
[1] "comfort" "one"

您可以将正则表达式与捕获组一起使用，以获取与所需模式匹配的所有子字符串，然后仅提取捕获组，如下所示：

# regex for strings that start with "no " and have any text after that
r <- '^no (.*)'
lapply(data,function(x) gsub(r,'\\1',regmatches(x,regexpr(r,x))))

#output
[[1]]
[1] "more"

[[2]]
[1] "comfort" "one"

regexpr返回一个匹配对象，regmatches将从中提取匹配的字符串，gsub使用\\1参数提取第一个捕获的组。

提取“ no”之后的单词的步骤：

首先，使用grep(i,pattern = "^no",value = T)获取以“ no”开头的文本。
gsub(pattern = "no ",replacement = "")将“ no”替换为“”。

然后您可以提取“ no”之后的单词。

lapply()可以拆分列表并将步骤应用于列表的元素。
%>%，管道运算符可以使代码清晰，并将grep()的结果放入gsub()。

library(magrittr)   
lapply(data,function(i)grep(i,value = T) %>% gsub(pattern = "no ",replacement = ""))
#[[1]]
#[1] "more"
#    
#[[2]]
#[1] "comfort" "one"

list r text-mining vector

本文链接：https://www.f2er.com/3051405.html

R中的向量列表-提取向量的元素

h520zh 回答：R中的向量列表-提取向量的元素

大家都在问