使用AppleScript解析HTML源代码

前端之家收集整理的这篇文章主要介绍了使用AppleScript解析HTML源代码前端之家小编觉得挺不错的,现在分享给大家,也给大家做个参考。
我正在尝试解析我已转换为Automator内部的TXT文件的HTML文件.

我之前使用Automator从网站下载了HTML文件,现在我正在努力解析源代码.

最好,我想获取表格的信息,我需要为1800个不同的HTML文件重复此操作.

以下是源代码示例:

  1. </head>
  2. <body>
  3. <div id="header">
  4. <div class="wrapper">
  5. <span class="access">
  6. <div id="fb-root"></div>
  7.  
  8.  
  9. <span class="access">
  10. Gold Account: <a class="upgrade" title="Account Details" href="http://www.hedge-professionals.com/account-details.html" >Active </a> Logged in as Edward&nbsp;&nbsp; | &nbsp;&nbsp;<a href="javascript:void(0);" onclick='logout()' class="logout">Sign Out</a>
  11.  
  12. </span>
  13. </span>
  14. </div><!-- /wrapper -->
  15. </div><!-- /header -->
  16.  
  17. <div id="masthead">
  18. <div class="wrapper">
  19. <a href="http://www.hedge-professionals.com" ><img src="http://www.hedge-professionals.com/images/hedgep_logo_white.png" alt="Hedge Professionals Database" width="333" height="46" class="logo" border="0" /></a>
  20. <div id="navigation">
  21. <ul>
  22. <li ><a href='http://www.hedge-professionals.com/dashboard.html' >Dashboard</a></li> <li ><a href='http://www.hedge-professionals.com/people.html'class='current' >People</a></li><li ><a href='http://www.hedge-professionals.com/watchlists.html' >My Watchlists</a></li><li ><a href='http://www.hedge-professionals.com/my-searches.html' >My Searches</a></li><li ><a href='http://www.hedge-professionals.com/my-profile.html' >My Profile</a></li></ul>
  23. </div><!-- /navigation -->
  24.  
  25. </div><!-- /wrapper -->
  26. </div><!-- /masthead -->
  27.  
  28.  
  29. <div id="content">
  30. <div class="wrapper">
  31. <div id="main-content">
  32.  
  33. <!-- per Project stuff -->
  34. <span class="section">
  35. <img src="http://www.hedge-professionals.com/images/people/noimage_53x53.jpg" alt="Christian Sieling" width="52" height="53" class="profile-pic" id="profile-pic-104947"/>
  36. <h1><span id="profile-name-104947" >Christian Sieling</span></h1>
  37. <ul class="gbutton-group right">
  38. <li><a class="gbutton bold pill" href="http://www.hedge-professionals.com/people.html">&laquo; Back </a></li>
  39. <li><a class="gbutton bold pill Boxy on-click" href="http://www.hedge-professionals.com/addtoWatchlist.PHP?usr=114752" id="row-104947" title='Add to Watchlist' >Add to Watchlist</a></li>
  40. </ul>
  41.  
  42. <div style="float:right;padding:3px 3px;text-align:center;margin-top:5px;" >
  43. <span id="profile-updated-date" >Updated On: 4 Aug,2010</span><br/>
  44. <a class="gbutton bold pill" href="http://www.hedge-professionals.com/profile/suggest/people/104947/Christian-Sieling" style="margin:5px;" title='Report Inaccurate Data' >Report Inaccurate Data</a>
  45. </div>
  46. <h2><span id="profile-details-104947" > at <a href="http://www.hedge-professionals.com/quicksearch/search/Lumix+Capital+Management+Ltd." ><span title='Lumix Capital Management Ltd.' >Lumix Capital Management Ltd.</span></a></span><input type="hidden" name="sub-id" id="sub-id" value="114752"></h2>
  47.  
  48. </span>
  49.  
  50. <table width="100%" border="0" cellspacing="0" cellpadding="0" id="profile-table">
  51. <tr>
  52. <th>Role</th>
  53. <td>
  54. <p>Other</p> </td>
  55. </tr>
  56. <tr>
  57. <th>Organisation Type</th>
  58. <td>
  59. <p>Asset Manager</p> </td>
  60. </tr>
  61. <tr>
  62. <th>Email</th>
  63. <td><a href="mailto:cs@lumixcapital.com" title="cs@lumixcapital.com" >cs@lumixcapital.com</a></td>
  64. </tr>
  65. <tr>
  66. <th>Website</th>
  67. <td><a href="http://www.lumixcapital.com/" target="_new" title="http://www.lumixcapital.com/" >http://www.lumixcapital.com/</a></td>
  68. </tr>
  69. <tr>
  70. <th>Phone</th>
  71. <td>41 78 616 7334</td>
  72. </tr>
  73. <tr>
  74. <th>Fax</th>
  75. <td></td>
  76. </tr>
  77. <tr>
  78. <th>Mailing Address</th>
  79. <td>Birrenstrasse 30</td>
  80. </tr>
  81. <tr>
  82. <th>City</th>
  83. <td>Schindellegi</td>
  84. </tr>
  85. <tr>
  86. <th>State</th>
  87. <td>CH</td>
  88. </tr>
  89. <tr>
  90. <th>Country</th>
  91. <td>Switzerland</td>
  92. </tr>
  93. <tr>
  94. <th class="lastrow" >Zip/ Postal Code</th>
  95. <td class="lastrow" >8834</td>
  96. </tr>
  97. </table>
  98. </div><!-- /main-content -->
  99. <div id="sidebar" >
  100. </div>
  101.  
  102. <div id="similar_sidebar" class="similar_refine" >
  103.  
  104.  
  105.  
  106. </div>
  107. </div><!-- /wrapper -->
  108. </div><!-- /content -->
  109.  
  110. <div id="footer">
  111.  
  112. </div>

我的AppleScript尝试使用文本项分隔符以类似的方式提取表:

  1. set p to input
  2. set ex to extractBetween(p,"<table>","</table>") -- extract the URL
  3. to extractBetween(SearchText,startText,endText)
  4. set tid to AppleScript's text item delimiters
  5. set AppleScript's text item delimiters to startText
  6. set endItems to text of text item -1 of SearchText
  7. set AppleScript's text item delimiters to endText
  8. set beginningToEnd to text of text item 1 of endItems
  9. set AppleScript's text item delimiters to tid
  10. return beginningToEnd
  11. end extractBetween

如何从HTML文件中解析表格?

解决方法

你真的很亲密问题是你的startText变量.起始表标记不在html文本中,因此无法找到.启动表的行实际上是……
  1. <table width="100%" border="0" cellspacing="0" cellpadding="0" id="profile-table">

所以我修改了你的代码,分两步寻找那个标签.第一…

  1. <table

然后分开……

  1. >

通过这种方式,我们可以忽略表标签附带的所有代码(宽度,边框等),因为我认为它们会在文件之间变化.执行此操作后,我们只获取表的代码.尝试这个…

  1. set p to input
  2. set ex to extractBetween(p,"<table",">","</table>")
  3.  
  4. to extractBetween(SearchText,startText1,startText2,endText)
  5. set tid to AppleScript's text item delimiters
  6. set AppleScript's text item delimiters to startText1
  7. set endItems to text item -1 of SearchText
  8. set AppleScript's text item delimiters to endText
  9. set beginningToEnd to text item 1 of endItems
  10. set AppleScript's text item delimiters to startText2
  11. set finalText to (text items 2 thru -1 of beginningToEnd) as text
  12. set AppleScript's text item delimiters to tid
  13. return finalText
  14. end extractBetween

猜你在找的HTML相关文章