如何在StormCrawler中将URL设置为文本文件?

我有许多URL(大约40,000个)需要使用StormCrawler进行爬网。 有什么方法可以将这些URL作为文本文件而不是 crawler.flux 中的列表传递?像这样:

spouts:
  - id: "spout"
    classname: "com.digitalpebble.stormcrawler.spout.MemorySpout"
    parallelism: 1
    constructorArgs:
      - "URLs.txt"
asdfasdf_lxn 回答:如何在StormCrawler中将URL设置为文本文件?

对于Solr和Elasticsearch,有一些注入器可以从文件中读取URL,并将它们作为DISCOVERED项添加到状态索引中。当然,要求使用Solr或Elasticsearch来保存状态索引。例如,喷射器是作为拓扑启动的。

storm ... com.digitalpebble.stormcrawler.elasticsearch.ESSeedInjector .../seeds '*' -conf ...
,

有一个FileSpout正是出于这个目的。它由@ sebastian-nagel提到的拓扑使用,您也可以在自己的拓扑中使用它们,例如参见this topology

,

根据 @Julien Nioche 的回答,我写了一个 crawler.flux 来满足我的要求。这是文件:

name: "crawler"

includes:
    - resource: true
      file: "/crawler-default.yaml"
      override: false

    - resource: false
      file: "crawler-conf.yaml"
      override: true

    - resource: false
      file: "solr-conf.yaml"
      override: true



spouts:

  - id: "spout"
    className: "com.digitalpebble.stormcrawler.solr.persistence.SolrSpout"
    parallelism: 1

  - id: "filespout"
    className: "com.digitalpebble.stormcrawler.spout.FileSpout"
    parallelism: 1
    constructorArgs:
      - "."
      - "seeds"
      - true

bolts:
  - id: "partitioner"
    className: "com.digitalpebble.stormcrawler.bolt.URLPartitionerBolt"
    parallelism: 1
  - id: "fetcher"
    className: "com.digitalpebble.stormcrawler.bolt.FetcherBolt"
    parallelism: 1
  - id: "sitemap"
    className: "com.digitalpebble.stormcrawler.bolt.SiteMapParserBolt"
    parallelism: 1
  - id: "parse"
    className: "com.digitalpebble.stormcrawler.bolt.JSoupParserBolt"
    parallelism: 5
  - id: "index"
    className: "com.digitalpebble.stormcrawler.solr.bolt.IndexerBolt"
    parallelism: 1
  - id: "status"
    className: "com.digitalpebble.stormcrawler.solr.persistence.StatusUpdaterBolt"
    parallelism: 1

streams:
  - from: "spout"
    to: "partitioner"
    grouping:
      type: SHUFFLE

  - from: "partitioner"
    to: "fetcher"
    grouping:
      type: FIELDS
      args: ["key"]

  - from: "fetcher"
    to: "sitemap"
    grouping:
      type: LOCAL_OR_SHUFFLE

  - from: "sitemap"
    to: "parse"
    grouping:
      type: LOCAL_OR_SHUFFLE

  - from: "parse"
    to: "index"
    grouping:
      type: LOCAL_OR_SHUFFLE

  - from: "fetcher"
    to: "status"
    grouping:
      type: FIELDS
      args: ["url"]
      streamId: "status"

  - from: "sitemap"
    to: "status"
    grouping:
      type: FIELDS
      args: ["url"]
      streamId: "status"

  - from: "parse"
    to: "status"
    grouping:
      type: FIELDS
      args: ["url"]
      streamId: "status"

  - from: "index"
    to: "status"
    grouping:
      type: FIELDS
      args: ["url"]
      streamId: "status"


  - from: "filespout"
    to: "status"
    grouping:
      streamId: "status"
      type: CUSTOM
      customClass:
        className: "com.digitalpebble.stormcrawler.util.URLStreamGrouping"
        constructorArgs:
          - "byDomain"

您可以设置URL文件所在的目录,而不是“。,而不是“种子” ,您可以输入URL文件名。 / p>

本文链接:https://www.f2er.com/3132581.html

大家都在问