使用Python抓取动态内容

2024-05-07 • 问答

我正在尝试通过网络抓取从以下网址中获取特定的数字：“ https://www.ulb.uni-muenster.de/”。该数字是动态的。不幸的是，当我搜索号码时，我只会得到课程，而没有号码。当我在Chrome浏览器中检查网址时，可以在源代码中清楚地看到该数字。我有两种方法：

import seaborn as sns
from urllib.request import urlopen
from bs4 import BeautifulSoup

url = 'https://www.ulb.uni-muenster.de/'
html = urlopen(url)
soup = BeautifulSoup(html,'lxml')
tags = soup.find('span',{'class': 'seatingsCounter'})
print(tags)

退出：<span class="seatingsCounter"></span>

import requests
r = requests.get('https://www.ulb.uni-muenster.de/')
data = BeautifulSoup(r.content)
examples = []
for d in data.findAll('a'):
    examples.append(d)
my_as = soup.findAll("span",{ "class" : "seatingsCounter" })

退出：[<span class="seatingsCounter"></span>]

这两个都不起作用，因为输出始终只是类。

如果您查看页面源代码，您会发现空闲位置的数量已由JavaScript函数showMessage更新：

var showMessage = function(data) {
                var locations = [ "ZB_LS","ZB_RS" ];
                var free = 0;
                var total = 0;
                var open = true;
                $('.availableSeatings .spinner').remove();
                $('.availableSeatings .error').data('counter',0);
                $.each(data.locations,function( key,value ) {
                    if ($.inArray( value.id,locations) !== -1)
                    {
                        free = free + Math.round((100 - value.quota) * value.places/100);
                        total = total + value.places;
                        open = open && value.open;
                    }
                });

                if (open)
                {
                    $('.availableSeatings .message').show().siblings().hide();
                    quota = Math.round(free/total * 100);
                    result = free + '<span class="quota">(' + quota + '%)</span>';
                    date = $.format.date(data.datetime,"dd.MM.yyyy,HH:mm");
                    $('.availableSeatings .seatingsCounter').html(result);  // <- HERE!!
                    $('.availableSeatings .updated .datetime').text(date);
                    $('.availableSeatings .updated').show();
                } else {
                    $('.availableSeatings .closed').show().siblings().hide();
                }
        };

在源代码的下方，您将看到以下行：

$.ajax({
            dataType: "json",url: "/available-seatings.json",\\ <-- THIS LOOKS INTERESTING
            timeout: 40000,success: function(data) { showMessage(data); },error: function() {
                counter = $('.availableSeatings .error').data('counter');
                if (isNaN(counter) || counter >= 3)
                {
                    showError();
                } else {
                    $('.availableSeatings .error').data('counter',counter + 1);
                }
            },complete: function() {
              setTimeout(worker,60000);
            }
          });

如果我们转到https://www.ulb.uni-muenster.de/available-seatings.json，则会看到类似以下内容的

：

{"datetime":"2019-11-13 13:49:46","locations":[{"id":"ZB_LS","label":"Zentralbibliothek Lesesaal","open":true,"quota":99,"places":678},{"id":"ZB_RS","label":"Zentralbibliothek Recherchesaal","quota":94,"places":154},{"id":"VSTH","label":"Bibliothek im Vom-Stein-Haus","quota":56,"places":145},{"id":"RWS1","label":"Bibliothek im Rechtswissenschaftlichen Seminar I \/ Einzelarbeitszone","quota":98,"places":352},{"id":"RWS1_G","label":"Bibliothek im Rechtswissenschaftlichen Seminar I \/ Gruppenarbeitszone","quota":30,"places":40},{"id":"RWS2","label":"Bibliothek im Rechtswissenschaftlichen Seminar II","quota":54,"places":162},{"id":"WIWI","label":"Fachbereichsbibliothek Wirtschaftswissenschaften \/ Einzelarbeitszone","quota":71,"places":132},{"id":"WIWI_G","label":"Fachbereichsbibliothek Wirtschaftswissenschaften \/ Gruppenarbeitszone","places":45},{"id":"ZBSOZ","label":"Zweigbibliothek Sozialwissenschaften","quota":74,"places":129},{"id":"FHAUS","label":"Gemeinschaftsbibliothek im F\u00fcrstenberghaus","quota":68,"places":197},{"id":"IFE","label":"Bibliothek des Instituts f\u00fcr Erziehungswissenschaft","quota":47,"places":183},{"id":"PHI","label":"Bibliotheken im Philosophikum (Domplatz 23)","places":98}]}

Voila，添加Python JSON模块可能比使用Selenium重写要容易得多，尽管这样做也可以。

使用Python抓取动态内容

a21221266 回答：使用Python抓取动态内容

大家都在问