在Python中使用BeautifulSoup从HTML文本的嵌套元素中获取文本

我正在尝试提取每天比赛的球队以及每个球队阵容中的活跃和无效球员。我要抓取的页面的URL是:https://stats.nba.com/lineups/。我一直在使用BeautifulSoup尝试获取此数据,并尝试了几种方法来获取数据,但是我似乎无法在

中提取任何内容

<div class=​"landing__flex-col lineups-game" data-game-state=​"3" nba-data-game=​"game" nba-with ng-include ng-repeat=​"game in games" src=​"'/​lineups-template.html'">​

我想让每场比赛中的球队

<div class=​"landing__flex-col lineups-game" data-game-state=​"3" nba-data-game=​"game" nba-with ng-include ng-repeat=​"game in games" src=​"'/​lineups-template.html'">​

中的每个玩家

<div class=​"columns small-6 lineups-game__team lineups-game__team--htm" nba-with nba-with-data-team=​"game.h" ng-include src=​"'/​lineups-team-template.html'">​

因此,在下面的html代码示例中,我想获取MEM,CHA,J。Valanciunas和J. Crowder的文本,并最终针对每个团队的每个球员执行此操作。

<div class="landing__flex-row lineups-games" ng-show="isLoaded &amp;&amp; hasData" aria-hidden="false">
          <!----><!----><div class="landing__flex-col lineups-game" ng-repeat="game in games" nba-with="" nba-data-game="game" data-game-state="3" ng-include="" src="'/lineups-template.html'">
  <div class="lineups-game__inner row">

    <div class="columns small-12 lineups-game__title">
      <a href="/game/0021900154/">
        <span class="lineups-game__team-name">MEM</span>
        <span class="lineups-game__vs">vs</span>
        <span class="lineups-game__team-name">CHA</span>
        <span class="lineups-game__status hide-for-live-game">Final</span>
        <span class="lineups-game__status hide-for-pre-game hide-for-post-game">Live</span>
      </a>
    </div>

    <!----><div class="columns small-6 lineups-game__team lineups-game__team--vtm" nba-with="" nba-with-data-team="game.v" ng-include="" src="'/lineups-team-template.html'">

  <!----><!----><div ng-if="team.hasBench" nba-with="" nba-with-data-team="team" ng-include="" src="'/lineups-confirmed-roster-template.html'">
  <div class="lineups-game__header">
    <img team-logo="" class="lineups-game__team-logo team-img" abbr="MEM" type="image/svg+xml" src="/media/img/teams/logos/MEM_logo.svg" alt="Memphis Grizzlies logo" title="Memphis Grizzlies logo">
    <span class="lineups-game__team-name">MEM</span>
  </div>

  <div class="lineups-game__roster-type lineups-game__roster-type--confirmed">active List</div>

  <ul class="lineups-game__roster lineups-game__roster--official">
    <!----><li class="lineups-game__player lineups-game__player--starter" ng-repeat="pl in team.starters">
      <a href="/player/202685/">
        <span class="lineups-game__pos">C</span>
        <span class="lineups-game__name">J. Valanciunas</span>
      </a>
    </li><!----><li class="lineups-game__player lineups-game__player--starter" ng-repeat="pl in team.starters">
      <a href="/player/203109/">
        <span class="lineups-game__pos">SF</span>
        <span class="lineups-game__name">J. Crowder</span>
      </a>

除其他方法外,我尝试执行以下操作无济于事:

gamesSource = urllib.request.urlopen('https://stats.nba.com/lineups/').read()
gamesSoup = bs.BeautifulSoup(gamesSource,'html.parser')

teams = gamesSoup.find_all("span",{"class":"lineups-game__teams-name"})

返回的所有内容都是一个空列表,当我尝试获取特定的“ span”行时,返回的所有内容都是“ None”。

让我知道出了什么问题,以及如何访问要获取的信息。

谢谢。

Sample of HTML Code

gaoshao1982 回答:在Python中使用BeautifulSoup从HTML文本的嵌套元素中获取文本

不幸的是,您不能使用urllib来做到这一点。有问题的网站在初始页面加载后使用js调用api来填充数据。

urllib仅能下载服务器提供的初始文件,但无法处理文件在浏览器中的初始渲染后可能正在执行的任何后续操作。

由于您通过teams = gamesSoup.find_all("span",{"class":"lineups-game__teams-name"})下载的实际HTML(如here所示)尚未填充urllib.request元素,因此lineups-game__teams-name调用返回空。

您可以尝试检查初始加载后网站正在进行的api调用(请检查“网络”标签),并查看是否可以找到所需数据的来源。如果幸运的话,您也许可以通过api调用获取该数据。由于该网页将发出大量外部请求(用于图像和其他媒体),因此您可以勾选XHR以在网络列表中仅向您显示远程API调用。

如果找不到api或外部调用被阻止,则可以尝试使用启用了js的python浏览器(即selenium)下载包含并执行JS代码的页面。

,

回滚已经说明的内容,因为此页面是通过api / js调用生成的,所以您将需要使用其他抓取库。我通常去硒。下面的代码将拉动所有团队和花名册并将它们放在一起。这段代码中可能会有一些怪癖,但我认为它将朝着正确的方向发展:

from selenium import webdriver
from selenium.webdriver.firefox.options import Options
from datetime import date

desired_link = 'https://stats.nba.com/lineups/'

fire_opts = webdriver.FirefoxOptions()
fire_opts.add_argument("-headless")
fire_path = 'geckodriver.exe'
driver = webdriver.Firefox(options=fire_opts,executable_path=fire_path)
driver.get(desired_link)

team_names_list = driver.find_elements_by_class_name('lineups-game__team-name')
team_names = []
for name in team_names_list:
    team_names.append(name.text)

starting_lineup_list = driver.find_elements_by_class_name('lineups-game__roster--projected')
starting_lineup = []
for lineup in starting_lineup_list:
    starting_lineup.append(lineup.text)

driver.quit()

for teams,players in zip(team_names,starting_lineup):
    print(teams,players)

这应该在页面上输出所有各个团队,如下所示:

DET PG D. Rose
SG L. Kennard
SF T. Snell
PF B. Griffin
C A. Drummond

可能格式化得更好一些,但您可以将其放入电子表格(或任何您喜欢的东西)中,以供使用...

,

您可以通过调用api获得它。只需动态更改date参数。这是一个示例:您需要遍历游戏/索引或展平json格式并重建为数据框:

import pandas as pd
import requests

url = 'https://stats.nba.com/js/data/dailylineups/2019/daily_lineups_20191118.json'
jsonData = requests.get(url).json()

print (pd.DataFrame(jsonData['results'][0]['LAC']))

输出:

  firstName  lastName playerId pos rotoId team
0   Patrick  Beverley   201976  PG   3072  LAC
1   Terance      Mann  1629611  SG   4860  LAC
2     Kawhi   Leonard   202695  SF   3195  LAC
3      Paul    George   202331  PF   3114  LAC
4     Ivica     Zubac   162726   C   3888  LAC
本文链接:https://www.f2er.com/3098210.html

大家都在问