使用Rvest在R中提取Youtube视频描述

2024-05-06 • 问答

我正在尝试使用Rvest提取YouTube视频描述。我知道，仅使用API会更容易，但最终目标是使自己更加熟悉Rvest，而不是仅仅获得Video描述。这是我到目前为止所做的：

# defining website
page <- "https://www.youtube.com/watch?v=4PqdqWWSHJY"

# setting Xpath
Xp <- '/html/body/div[2]/div[4]/div/div[5]/div[2]/div[2]/div/div[2]/meta[2]'

# getting page
Website <- read_html(page)

# printing description
html_attr(Description,name = "content")

虽然这确实指向视频说明，但我没有完整的视频说明，而是一个字符串，该字符串在几行后被切断：

[1] "The Conservatives and Labour have been outlining their main pitch to voters. The Prime Minister Boris Johson in his first major speech of the campaign said a..."

预期的输出将是完整的描述

"The Conservatives and Labour have been outlining their main pitch to voters. The Prime Minister Boris Johnson in his first major speech of the campaign said a Conservative government would unite the country and "level up" the prospects for people with massive investment in health,better infrastructure,more police,and a green revolution. But he said the key issue to solve was Brexit. Meanwhile Labour vowed to outspend the Tories on the NHS in England. 

Labour leader Jeremy Corbyn has also faced questions over his position on allowing a second referendum on Scottish independence. Today at the start of a two-day tour of Scotland,he said wouldn't allow one in the first term of a Labour government but later rowed back saying it wouldn't be a priority in the early years. 

Sophie Raworth presents tonight's BBC News at Ten and unravels the day's events with the BBC's political editor Laura Kuenssberg,health editor Hugh Pym and Scotland editor Sarah Smith.


Please subscribe HERE: LINK"

是否可以通过rvest获取完整的说明？

正如您所说的，您专注于学习，在显示代码之后，我添加了一些说明如何到达那里的。

可复制的代码：

library(rvest)
library(magrittr)
url <- "https://www.youtube.com/watch?v=4PqdqWWSHJY"
url %>% 
  read_html %>% 
  html_nodes(xpath = "//*[@id = 'eow-description']") %>% 
  html_text

说明：

1。定位元素

有几种方法可以解决此问题。通常的第一步是在浏览器中右键单击目标元素，然后选择“检查元素”。您将看到像这样的东西：

接下来，您可以尝试提取数据。

url %>% 
      read_html %>% 
      html_nodes(xpath = "//*[@id = 'description']")

不幸的是，这不适用于您的情况。

2。确保您拥有正确的来源

因此，您必须确保目标数据在加载的文档中。您可以在浏览器的网络活动中看到这一点，或者如果您想在R中进行检查，我为此编写了一个小函数：

showHtmlPage <- function(doc){
  tmp <- tempfile(fileext = ".html")
  doc %>% toString %>% writeLines(con = tmp)
  tmp %>% browseURL(browser = rstudioapi::viewer)
}

用法：

url %>% read_html %>% showHtmlPage

您将看到目标数据实际上在您下载的文档中。因此，您可以坚持使用rvest。接下来，您必须找到xpath（或css），...

3。在下载的文档中找到目标标记

您可以搜索包含您要查找的文本的标签

doc %>% html_nodes(xpath = "//*[contains(text(),'The Conservatives and ')]")

输出将是：

{xml_nodeset (1)}
[1] <p id="eow-description" class="">The Conservatives and Labour have ....

，在那里您看到的是正在寻找ID为eow-description的标签。

使用Rvest在R中提取Youtube视频描述

cjh0971 回答：使用Rvest在R中提取Youtube视频描述

大家都在问