我已经在AWS控制台上成功设置了粘合爬虫。
现在,我有了一个Cloudformation模板来模拟整个过程,除了无法将Exclusions:字段添加到模板中。背景:从AWS Glue API中,Exclusions:
字段表示全局模式,以排除与数据存储(在我的示例中为S3数据存储)中的特定模式匹配的文件或文件夹。
尽管脚本中所有其他值都与爬网程序配置一起填充,例如,S3Target,爬网程序名称,IAM角色和分组行为以及所有这些粘胶设置,但我花了很大的力气仍无法在glob爬虫控制台上填充glob模式/ fields从CFN模板成功填充,除了Exclusions字段(在Glue Console上也称为排除模式)以外的所有字段。我的CFN模板通过了验证,我运行了搜寻器,希望尽管隐藏的排除glob仍然会以某种方式产生影响,但是不幸的是我似乎无法填充“排除”字段?
Here's the S3Target Exclusion AWS Glue API guide
Here's an AWS sample YAML CFN for a Glue Crawler
Here's a helpful YAML string array guide
YAML
CFNCrawlerSecDeraNUM:
Type: AWS::Glue::Crawler
Properties:
Name: !Ref CFNCrawlerName
Role: !Getatt CFNRoleSecDERA.Arn
#Classifiers: none,use the default classifier
Description: AWS Glue crawler to crawl SecDERA data
#Schedule: none,use default run-on-demand
DatabaseName: !Ref CFNDatabaseName
Targets:
S3Targets:
- Exclusions:
- "*/readme.htm"
- "*/sub.txt"
- "*/pre.txt"
- "*/tag.txt"
- Path: "s3://sec-input"
TablePrefix: !Ref CFNTablePrefixName
SchemaChangePolicy:
UpdateBehavior: "UPDATE_IN_DATABASE"
DeleteBehavior: "LOG"
# Added single schema grouping Glue API option
Configuration: "{\"Version\":1.0,\"CrawlerOutput\":{\"Partitions\":{\"AddOrupdateBehavior\":\"InheritFromTable\"},\"Tables\":{\"AddOrupdateBehavior\":\"MergeNewColumns\"}},\"Grouping\":{\"TableGroupingPolicy\":\"CombineCompatibleSchemas\"}}"
JSON
"CFNCrawlerSecDeraNUM": {
"Type": "AWS::Glue::Crawler","Properties": {
"Name": {
"Ref": "CFNCrawlerName"
},"Role": {
"Fn::Getatt": [
"CFNRoleSecDERA","Arn"
]
},"Description": "AWS Glue crawler to crawl SecDERA data","DatabaseName": {
"Ref": "CFNDatabaseName"
},"Targets": {
"S3Targets": [
{
"Exclusions": [
"*/readme.htm","*/sub.txt","*/pre.txt","*/tag.txt"
]
},{
"Path": "s3://sec-input"
}
]
},"TablePrefix": {
"Ref": "CFNTablePrefixName"
},"SchemaChangePolicy": {
"UpdateBehavior": "UPDATE_IN_DATABASE","DeleteBehavior": "LOG"
},"Configuration": "{\"Version\":1.0,\"Grouping\":{\"TableGroupingPolicy\":\"CombineCompatibleSchemas\"}}"
}
}