在条件下使用scrapy选择器

2024-05-12 • 问答

我正在使用“ scrapy”来刮几篇文章，例如：https://fivethirtyeight.com/features/championships-arent-won-on-paper-but-what-if-they-were/
我在蜘蛛中使用以下代码：

def calcula_masa_atomica(molecula):
    return sum(masas.get(atom,0) * int(nb or 1) for _,atom,nb in re.findall(r'(([A-Z])([0-9]*))',molecula))

...有效。但是我想使这个CSS选择器更加复杂。现在，我正在提取每个文本段落。但是看这篇文章，里面有表格和可视化内容，其中也包括文本。 HTML结构如下所示：

    def parse_article(self,response):
       il = ItemLoader(item=Scrapping538Item(),response=response)
       il.add_css('article_text','.entry-content *::text')

上面的代码被剪断了，我得到了类似的东西：

我想要的文字
  我要的文字
  我要的文字
  TITLE文字   SUB-TITLE-文本   表格数据   我要的文字
  我想要的文字

我的问题：

如何以某种方式修改<div class="entry-content single-post-content"> <p>text I want</p> <p>text I want</p> <p>text I want</p> <section class="viz"> <header class="viz"> <h5 class="title">TITLE-text</h5> <p class="subtitle">SUB-TITLE-text</p> </header> <table class="viz full"">TABLE DATA</table> </section> <p>text I want</p> <p>text I want</p> </div>函数是否需要除表格中的文字以外的所有文字？
使用功能add_css()会更容易吗？
通常，最佳做法是什么？（提取文字在一定条件下）

非常感谢您反馈

您可以使用XPath和ancestor轴获取所需的输出：

'//*[contains(@class,"entry-content")]//text()[not(ancestor::*[@class="viz"])]'

除非我错过了一些关键的事情，否则以下xpath应该起作用：

import scrapy
import w3lib

raw = response.xpath(
    '//div[contains(@class,"entry-content") '
    'and contains(@class,"single-post-content")]/p'
).extract()

这将忽略表内容，仅产生段落和链接中的文本作为列表。但是有一个陷阱！由于我们没有使用/text()，因此所有<p>和<a>标签仍然存在。让我们删除它们：

cleaned = [w3lib.html.remove_tags(block) for block in raw]

在CSS表达式中使用>到limit it to children (direct descendants)。

.entry-content > *::text

在条件下使用scrapy选择器

xiaonv5835335 回答：在条件下使用scrapy选择器

大家都在问