有这个HTML:
<div>
<div data-id="1"> </div>
<div data-id="2"> </div>
<div data-id="3"> </div>
...
<div> </div>
</div>
我需要选择仅具有属性div
(与值无关)的内部data-id
。如何使用Scrapy实现这一目标?
有这个HTML:
<div>
<div data-id="1"> </div>
<div data-id="2"> </div>
<div data-id="3"> </div>
...
<div> </div>
</div>
我需要选择仅具有属性div
(与值无关)的内部data-id
。如何使用Scrapy实现这一目标?
您可以使用以下
response.css('div[data-id]').extract()
它将为您提供具有data-id
属性的所有div的列表。
[u'<div data-id="1"> </div>',u'<div data-id="2"> </div>',u'<div data-id="3"> </div>']
,
使用BeautifulSoup。代码
from bs4 import BeautifulSoup
soup = BeautifulSoup("""<div> <div data-id="1"> </div> <div data-id="2"> </div> <div data-id="3"> </div><div> </div> </div>""")
print(soup.find_all("div",{"data-id":True}))
输出:
[<div data-id="1"> </div>,<div data-id="2"> </div>,<div data-id="3"> </div>]
您可以指定find
或find_all
中存在的属性,其值为True
<li class="gb_i" aria-grabbed="false">
<a class="gb_d" data-pid="192" draggable="false" href="xyz.com" id="gb192">
<div data-class="gb_u"></div>
<div data-class="gb_v"></div>
<div data-class="gb_w"></div>
<div data-class="gb_x"></div>
</a>
</li>
看一下上面的HTML代码示例。 在Scrapy v1.6 +中获取所有包含数据类的div
response.xpath('//a[@data-pid="192"]/div[contains(@data-class,"")]').getall()
在scrapy版本
, scrapy shell
In [1]: b = '''
...: <div>
...: <div data-id="1">gdfg </div>
...: <div data-id="2">dgdfg </div>
...: <div data-id="3">asdasd </div>
...: <div> </div>
...: </div>
...: '''
In [2]: from scrapy import Selector
In [3]: sel = Selector(text=b,type="html")
In [4]: sel.xpath('//div[re:test(@data-id,"\d")]/text()').extract()
Out[4]: ['gdfg ','dgdfg ','asdasd ']