使用RVest的Web抓取范围

2024-05-04 • 问答

我想提取网站https://www.sec.gov/ix?doc=/Archives/edgar/data/918160/000091816018000065/form10-k2017.htm中包含的文本。我正在查看《财务报表》标题上的意见，我只需要提取一段包含“伴随合并”一词的段落。如果存在匹配项，则应返回所有以“我们已经审核了.....”开头的文本。我想将其输出到文本文件中。我尝试了其他选项，但找不到正确的代码来获取此文本。有人可以帮我解决这个问题吗？

以下我用来提取信息的代码。但是我得到的是空字符串。

library(rvest)

sample_url="https://www.sec.gov/ix?doc=/Archives/edgar/data/918160/000091816018000065/form10-k2017.htm"

cont<- read_html(sample_url)

output= gsub('\r\n',' ',html_nodes(cont_sree,'p') %>% html_text())

text=output[grepl("accompanying consolidated",output)]

如果刷新页面并使用“网络”选项卡，您将看到感兴趣的内容的替代来源。您会注意到它返回XBRL文档。我也许会考虑使用xpath而不是正则表达式来匹配包含该文本的span并获取父div；因为那确实是页面上的表示。然后，在提取文本时，检查是否为NA。

R：

library(rvest)
library(magrittr)

node_text <- read_html('https://www.sec.gov/Archives/edgar/data/918160/000091816018000065/form10-k2017.htm')%>%
     html_node(xpath="//span[contains(text(),'accompanying consolidated')]/parent::div")%>%
     html_text()
result <- ifelse(is.na(node_text),'not found',node_text)
result

Py（bs4 4.7.1 +）：

import requests
from bs4 import BeautifulSoup as bs

r = requests.get('https://www.sec.gov/Archives/edgar/data/918160/000091816018000065/form10-k2017.htm')
soup = bs(r.content,'lxml')
target = soup.select_one('div:has(span:contains("accompanying consolidated"))')
if target is None:
    print('Not found')
else:
    print(target.text)

在回答之前，他们都经过了测试。

例如（R）：

py：

使用RVest的Web抓取范围

comeonace 回答：使用RVest的Web抓取范围

大家都在问