python – “在接收控制消息时出错(SocketClosed):在Tor的干控制器中出现空插槽内容”

前端之家收集整理的这篇文章主要介绍了python – “在接收控制消息时出错(SocketClosed):在Tor的干控制器中出现空插槽内容”前端之家小编觉得挺不错的,现在分享给大家,也给大家做个参考。

我正在使用一个使用Tor的刮刀,在这个示例项目中有一个简化版本:https://github.com/khpeek/scraper-compose.该项目具有以下(简化)结构:

  1. .
  2. ├── docker-compose.yml
  3. ├── privoxy
  4.    ├── config
  5.    └── Dockerfile
  6. ├── scraper
  7.    ├── Dockerfile
  8.    ├── requirements.txt
  9.    ├── tutorial
  10.       ├── scrapy.cfg
  11.       └── tutorial
  12.       ├── extensions.py
  13.       ├── __init__.py
  14.       ├── items.py
  15.       ├── middlewares.py
  16.       ├── pipelines.py
  17.       ├── settings.py
  18.       ├── spiders
  19.          ├── __init__.py
  20.          └── quotes_spider.py
  21.       └── tor_controller.py
  22.    └── wait-for
  23.    └── wait-for
  24. └── tor
  25. ├── Dockerfile
  26. └── torrc

定义quotes_spider.py的蜘蛛是一个非常简单的基于Scrapy Tutorial

  1. import scrapy
  2. from tutorial.items import QuoteItem
  3. class QuotesSpider(scrapy.Spider):
  4. name = "quotes"
  5. start_urls = ['http://quotes.toscrape.com/page/{n}/'.format(n=n) for n in range(1,3)]
  6. custom_settings = {
  7. 'TOR_RENEW_IDENTITY_ENABLED': True,'TOR_ITEMS_TO_SCRAPE_PER_IDENTITY': 5
  8. }
  9. download_delay = 2 # Wait 2 seconds (actually a random time between 1 and 3 seconds) between downloading pages
  10. def parse(self,response):
  11. for quote in response.css('div.quote'):
  12. item = QuoteItem()
  13. item['text'] = quote.css('span.text::text').extract_first()
  14. item['author'] = quote.css('small.author::text').extract_first()
  15. item['tags'] = quote.css('div.tags a.tag::text').extract()
  16. yield item

在settings.py中,我用线路激活了Scrapy extension

  1. EXTENSIONS = {
  2. 'tutorial.extensions.TorRenewIdentity': 1,}

其中extensions.py是

  1. import logging
  2. import random
  3. from scrapy import signals
  4. from scrapy.exceptions import NotConfigured
  5. import tutorial.tor_controller as tor_controller
  6. logger = logging.getLogger(__name__)
  7. class TorRenewIdentity(object):
  8. def __init__(self,crawler,item_count):
  9. self.crawler = crawler
  10. self.item_count = self.randomize(item_count) # Randomize the item count to confound traffic analysis
  11. self._item_count = item_count # Also remember the given item count for future randomizations
  12. self.items_scraped = 0
  13. # Connect the extension object to signals
  14. self.crawler.signals.connect(self.item_scraped,signal=signals.item_scraped)
  15. @staticmethod
  16. def randomize(item_count,min_factor=0.5,max_factor=1.5):
  17. '''Randomize the number of items scraped before changing identity. (A similar technique is applied to Scrapy's DOWNLOAD_DELAY setting).'''
  18. randomized_item_count = random.randint(int(min_factor*item_count),int(max_factor*item_count))
  19. logger.info("The crawler will scrape the following (randomized) number of items before changing identity (again): {}".format(randomized_item_count))
  20. return randomized_item_count
  21. @classmethod
  22. def from_crawler(cls,crawler):
  23. if not crawler.settings.getbool('TOR_RENEW_IDENTITY_ENABLED'):
  24. raise NotConfigured
  25. item_count = crawler.settings.getint('TOR_ITEMS_TO_SCRAPE_PER_IDENTITY',50)
  26. return cls(crawler=crawler,item_count=item_count) # Instantiate the extension object
  27. def item_scraped(self,item,spider):
  28. '''When item_count items are scraped,pause the engine and change IP address.'''
  29. self.items_scraped += 1
  30. if self.items_scraped == self.item_count:
  31. logger.info("Scraped {item_count} items. Pausing engine while changing identity...".format(item_count=self.item_count))
  32. self.crawler.engine.pause()
  33. tor_controller.change_identity() # Change IP address (cf. https://stem.torproject.org/faq.html#how-do-i-request-a-new-identity-from-tor)
  34. self.items_scraped = 0 # Reset the counter
  35. self.item_count = self.randomize(self._item_count) # Generate a new random number of items to scrape before changing identity again
  36. self.crawler.engine.unpause()

和tor_controller.py是

  1. import logging
  2. import sys
  3. import socket
  4. import time
  5. import requests
  6. import stem
  7. import stem.control
  8. # Tor settings
  9. TOR_ADDRESS = socket.gethostbyname("tor") # The Docker-Compose service in which this code is running should be linked to the "tor" service.
  10. TOR_CONTROL_PORT = 9051 # This is configured in /etc/tor/torrc by the line "ControlPort 9051" (or by launching Tor with "tor -controlport 9051")
  11. TOR_PASSWORD = "foo" # The Tor password is written in the docker-compose.yml file. (It is passed as a build argument to the 'tor' service).
  12. # Privoxy settings
  13. PRIVOXY_ADDRESS = "privoxy" # This assumes this code is running in a Docker-Compose service linked to the "privoxy" service
  14. PRIVOXY_PORT = 8118 # This is determined by the "listen-address" in Privoxy's "config" file
  15. HTTP_PROXY = 'http://{address}:{port}'.format(address=PRIVOXY_ADDRESS,port=PRIVOXY_PORT)
  16. logger = logging.getLogger(__name__)
  17. class TorController(object):
  18. def __init__(self):
  19. self.controller = stem.control.Controller.from_port(address=TOR_ADDRESS,port=TOR_CONTROL_PORT)
  20. self.controller.authenticate(password=TOR_PASSWORD)
  21. self.session = requests.Session()
  22. self.session.proxies = {'http': HTTP_PROXY}
  23. def request_ip_change(self):
  24. self.controller.signal(stem.Signal.NEWNYM)
  25. def get_ip(self):
  26. '''Check what the current IP address is (as seen by IPEcho).'''
  27. return self.session.get('http://ipecho.net/plain').text
  28. def change_ip(self):
  29. '''Signal a change of IP address and wait for confirmation from IPEcho.net'''
  30. current_ip = self.get_ip()
  31. logger.debug("Initializing change of identity from the current IP address,{current_ip}".format(current_ip=current_ip))
  32. self.request_ip_change()
  33. while True:
  34. new_ip = self.get_ip()
  35. if new_ip == current_ip:
  36. logger.debug("The IP address is still the same. Waiting for 1 second before checking again...")
  37. time.sleep(1)
  38. else:
  39. break
  40. logger.debug("The IP address has been changed from {old_ip} to {new_ip}".format(old_ip=current_ip,new_ip=new_ip))
  41. return new_ip
  42. def __enter__(self):
  43. return self
  44. def __exit__(self,*args):
  45. self.controller.close()
  46. def change_identity():
  47. with TorController() as tor_controller:
  48. tor_controller.change_ip()

如果我开始使用docker-compose build进行爬网,然后使用docker-compose up,大体上扩展工作:根据日志,它成功更改IP地址并继续抓取.

然而,令我烦恼的是,在引擎暂停期间,我看到了错误消息,例如

  1. scraper_1 | 2017-05-12 16:35:06 [stem] INFO: Error while receiving a control message (SocketClosed): empty socket content

其次是

  1. scraper_1 | 2017-05-12 16:35:06 [stem] INFO: Error while receiving a control message (SocketClosed): received exception "peek of closed file"

是什么导致了这些错误?既然他们有INFO级别,也许我可以忽略它们? (我在https://gitweb.torproject.org/stem.git/看了一下Stem的源代码,但到目前为止还没有能够处理正在发生的事情).

最佳答案
我不知道你对你的问题是否得出了什么结论.

我实际上得到了与你相同的日志消息.我的Scrapy项目表现良好,使用Tor和privoxy的ip轮换也成功.我只是不停地获取日志INFO:[stem]接收控制消息时出错(SocketClosed):空的套接内容,这让我感到害怕.

我花了一些时间来寻找导致它的原因,并且看看我是否可以忽略它(毕竟,它只是一条信息消息而不是错误消息).

底线是我不知道是什么导致它,但我觉得它是安全的,可以忽略它.

正如日志所说,套接内容(实际上是包含有关套接字连接的相关信息的stem control_file)是空的.当control_file为空时,它会触发关闭套接字连接(根据python套接字文档).我不确定是什么原因导致control_file为空以关闭套接字连接.但是,如果套接字连接真的关闭,看起来套接字连接成功打开,因为我的scrapy的爬行作业和ip旋转效果很好.虽然我找不到它的真正原因,但我只能假设一些原因:(1)Tor网络不稳定,(2)当你的代码运行controller.signal(Signal.NEWNYM)时,套接字暂时关闭,再次获得开放,或者其他一些我目前无法想到的原因.

猜你在找的Python相关文章