Bash Shell脚本查找机器人元标记值

2024-04-30 • 问答

我发现此bash脚本到check status of URLs from text file，并在进行重定向时打印目标URL：

#!/bin/bash
while read url
do
    dt=$(date '+%H:%M:%S');
    urlstatus=$(curl -kH 'Cache-Control: no-cache' -o /dev/null --silent --head --write-out '%{http_code} %{redirect_url}' "$url" )
    echo "$url $urlstatus $dt" >> urlstatus.txt

done < $1

我对bash不太满意：我想为每个网址添加其Robots元标记（如果存在）的值

实际上，我真的建议使用DOM解析器（例如Nokogiri，hxselect等），但是您可以这样做（例如处理以<meta开头的行并“提取”机器人属性内容的值）：

curl -s "$url" | sed -n '/\<meta/s/\<meta[[:space:]][[:space:]]*name="*robots"*[[:space:]][[:space:]]*content="*\([^"]*\)"*\>/\1/p'

这将打印属性值或空字符串（如果不可用）。

您需要纯Bash解决方案吗？还是您有sed？

您可以添加一行以从页面的源代码中提取机器人的 meta 标头，并使用 echo 修改该行以显示其值：

grid-column-gap

此示例使用机器人的 meta 标头记录原始行。

如果您要在页面上没有用于机器人的 meta 标头时添加标记“-”，则可以更改ul行，并添加以下行：

grid-column-gap: 2px;

如果要提取属性的确切值，则可以更改该行：

#!/bin/bash
while read url
do
    dt=$(date '+%H:%M:%S');
    urlstatus=$(curl -kH 'Cache-Control: no-cache' -o /dev/null --silent --head --write-out '%{http_code} %{redirect_url}' "$url" )
    metarobotsheader=$(curl -kH 'Cache-Control: no-cache' --silent "$url" | grep -P -i "<meta.+robots" )
    echo "$url $urlstatus $dt $metarobotsheader" >> urlstatus.txt
done < $1

当URL不包含任何用于机器人的 meta 标头时，它将显示 no_meta_robots 。

Bash Shell脚本查找机器人元标记值

wxw111 回答：Bash Shell脚本查找机器人元标记值

大家都在问