为防止冗余数据,您必须在 CLEANPATH
语句中使用 Redshift 的 UNLOAD
选项。请注意与 the documentation 的区别(也许 AWS 可以更清楚地说明这一点):
ALLOWOVERWRITE
By default,UNLOAD fails if it finds files that it would possibly overwrite. If ALLOWOVERWRITE is specified,UNLOAD overwrites existing files,including the manifest file.
CLEANPATH
The CLEANPATH option removes existing files located in the Amazon S3 path specified in the TO clause before unloading files to the specified location.
If you include the PARTITION BY clause,existing files are removed only from the partition folders to receive new files generated by the UNLOAD operation.
You must have the s3:DeleteObject permission on the Amazon S3 bucket. For information,see Policies and Permissions in Amazon S3 in the Amazon Simple Storage Service Console User Guide. Files that you remove by using the `CLEANPATH` option are permanently deleted and can't be recovered.
You can't specify the `CLEANPATH` option if you specify the `ALLOWOVERWRITE` option.
因此,正如 @Vzzarr 所说,ALLOWOVERWRITE
只会覆盖与传入文件名共享相同名称的文件。对于不需要保持过去数据状态不变的重复卸载操作,您必须使用 CLEANPATH
。
请注意,您不能在同一个 UNLOAD 语句中同时使用 ALLOWOVERWRITE
和 CLEANPATH
。
这是一个例子:
f"""
UNLOAD ('{your_query}')
TO 's3://{destination_prefix}/'
iam_role '{IAM_ROLE_ARN}'
PARQUET
MAXFILESIZE 4 GB
MANIFEST verbose
CLEANPATH
"""
,
根据我的经验,ALLOWOVERWRITE参数仅基于生成的文件名:因此,仅当2个文件具有相同的名称时,结果才会被覆盖。
此参数在大多数情况下均有效,但在此域中,“大多数情况”还不够好。从那时起我就停止使用它了(我很失望)。相反,我要做的是从S3控制台中手动删除文件(或实际上将它们移动到暂存文件夹中),然后在不依赖ALLOWOVERWRITE参数的情况下卸载数据。
也在此答案https://stackoverflow.com/a/61594603/4725074的评论中提到
本文链接:https://www.f2er.com/1967429.html