是否有一种自动方法通过Google Search API提取网址而不会被禁止IP？

2024-05-08 • 问答

目前，我正在使用Google搜索API开发一个项目，以找到最相关的网站以获取单词列表。该列表是一个导入的Excel工作表，其中包含20-100个公司名称，对于每一个公司名称，我都要求获得最相关的Google搜索结果并提取该URL。对于几乎不同的公司名单，我几乎每天都要花大约2周的时间。

由于即时通讯针对所有单词进行了此操作，因此使用它1周后，我被Google禁止使用IP。一天后，我在两次请求之间使用了更长且随机的等待时间，然后再次尝试，但是在5次请求之后，我仍然被再次封锁。我使用30到60秒之间的随机等待时间来模拟人类行为。

我正在使用库googlesearch.search来执行Google搜索。

from googlesearch import search

def find_website(names):
    #Empty list,where found URL's will be added
    links = []
    #Iterate over the names in the dataframe column which contains all the names
    for i in names.itertuples():
        #Setting a random wait time for the request to be sent
        search_time = random.uniform(30,60)
        #Try to find URL of most relevant website until it finds it and append the URL to the list
        while True:
            try:
                for j in search(i.name,lang='en',tld = 'co.in',num = 5,stop = 1,pause = search_time):
                    links.append(j)
                    time.sleep(1)
            except urllib.error.HTTPError as err:
                print(err.code)
                print(err.headers)
                print(err.read())
                time.sleep(1)
                continue
            break

我知道他们正在使用高级算法来检测自动化脚本，但是是否有替代方法，例如使用代理，...？或者我可以使用其官方付费版本的Google搜索API来执行此操作，还是使用microsoft版本的bing API。

或者我可以在代码中更改某些内容或使用其他库来防止进一步的禁令吗？

是否有一种自动方法通过Google Search API提取网址而不会被禁止IP？

zhaiguanghusha 回答：是否有一种自动方法通过Google Search API提取网址而不会被禁止IP？

大家都在问