雅典娜（Athena）中的JSON数组读取不正确，无法取消嵌套

2024-05-20 • 问答

我有一个名为uf的列，其中包含JSON对象数组。这是一个样机：

[
    {"type": "browserId","name": "","value": "unknown"},{"type": "campaign","value": "om_227dec0082a5"},{"type": "custom","name": "2351350529","value": "10148"},"name": "9501713387","value": "true"},"name": "9517735577","name": "9507402548","name": "9733902068","value": "true"}
]

我正试图将其作为子记录，但是由于某些原因，我找不到首先取消嵌套的正确方法。然后我注意到我的整个数组都包裹在另一个JSON对象中。

我在这里：

我尝试了简单选择，并注意到结果是：

[{type=[{"type": "browserId","value": "ff"},name=null,value=null}]

此列的定义如下：

{
    "Name": "uf","Type": "array<struct<type:string,name:string,value:string>>"
}

定义不正确吗，这就是为什么我将整个数组包装在另一个json数组中？

-编辑这是我的csv文件（制表符分隔）的示例。最近两天花了很多时间来查看是否与JSON有关，这使Glue无法将列识别为JSON数组，但是我创建了一个带有简单JSON数组的新列，该列已正确分配为array<struct，但在查询后我得到的正是与上述相同的问题

timestamp   project_id  campaign_id experiment_id   variation_id    layer_holdback  audience_names  end_user_id uuid    session_id  snippet_revision    user_ip user_agent  user_engine user_engine_version referer global_holdback event_type  event_name  uf  active_views    event_features  event_metrics   event_uuid
1570326511  74971132    11089500404 11097730080 11078120202 false   []  oeu1535997971348r0.4399811351004357     AUTO    6540    5.91.170.0  Mozilla/5.0 (Linux; Android 7.0; SAMSUNG SM-G925F Build/NRD90M) AppleWebKit/537.36 (KHTML,like Gecko) SamsungBrowser/9.2 Chrome/67.0.3396.87 Mobile Safari/537.36  js  0.128.0 https://www.zavamed.com/uk/account/ false   view_activated  10832783364 [{"type": "browserId",{"type": "device","value": "mobile"},{"type": "device_type","value": "phone"},{"type": "referrer","value": "https:\/\/www.google.co.uk\/"},{"type": "source_type","value": "campaign"},{"type": "currentTimestamp","value": "-1631518596"},{"type": "offset","value": "-60"}]    []  []  []  4926a5f1-bbb5-4553-9d0b-b26f773fa0f4

dhtz126 回答：雅典娜（Athena）中的JSON数组读取不正确，无法取消嵌套

我将带有您提供的内容的示例csv文件上传到S3。然后我在上面跑了一个胶履带。这是我最终得到的表模式：

CREATE EXTERNAL TABLE `question_58765672`(
  `timestamp` bigint,`project_id` bigint,`campaign_id` bigint,`experiment_id` bigint,`variation_id` bigint,`layer_holdback` boolean,`audience_names` array<string>,`end_user_id` string,`uuid` string,`session_id` string,`snippet_revision` bigint,`user_ip` string,`user_agent` string,`user_engine` string,`user_engine_version` string,`referer` string,`global_holdback` boolean,`event_type` string,`event_name` bigint,`uf` string,`active_views` array<string>,`event_features` array<string>,`event_metrics` array<string>,`event_uuid` string)
ROW FORMAT DELIMITED 
  FIELDS TERMINATED BY '\t' 
STORED AS INPUTFORMAT 
  'org.apache.hadoop.mapred.TextInputFormat' 
OUTPUTFORMAT 
  'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'
LOCATION
  's3://__S3_PATH_IN_MY_BUCKET__/'
TBLPROPERTIES (
  'CrawlerSchemaDeserializerVersion'='1.0','CrawlerSchemaSerializerVersion'='1.0','UPDATED_BY_CRAWLER'='some-crawler','areColumnsQuoted'='false','averageRecordSize'='553','classification'='csv','columnsOrdered'='true','compressionType'='none','delimiter'='\t','objectCount'='1','recordCount'='2','sizeKey'='1109','skip.header.line.count'='1','typeOfData'='file')

您可以看到它将列uf标识为字符串，对此我并不感到意外。为了取消嵌套此列，我必须手动将其强制转换为正确的类型ARRAY(JSON)：

SELECT
    "timestamp",_unnested_column
FROM
    "stackoverflow"."question_58765672",UNNEST( CAST(json_parse(uf) AS ARRAY(JSON)) ) AS t(_unnested_column)

结果：

    timestamp   _unnested_column
1   1570326511  {"name":"","type":"browserId","value":"unknown"}
2   1570326511  {"name":"","type":"device","value":"mobile"}
3   1570326511  {"name":"","type":"device_type","value":"phone"}
4   1570326511  {"name":"","type":"referrer","value":"https://www.google.co.uk/"}
5   1570326511  {"name":"","type":"source_type","value":"campaign"}
6   1570326511  {"name":"","type":"currentTimestamp","value":"-1631518596"}
7   1570326511  {"name":"","type":"offset","value":"-60"}

然后，我想到了创建一个athena views的地方，其中列uf会被正确地投射：

CREATE OR REPLACE VIEW question_58765672_v1_json AS
SELECT
    CAST(json_parse(uf) AS ARRAY(JSON)) as uf
    -- ALL other columns from your table
FROM 
    "stackoverflow"."question_58765672"

但是，出现以下错误：

列uf的列类型无效：不受支持的配置单元类型：json

我的猜测是，列uf的模式对于胶粘剂抓取器来说太过复杂以至于无法正确识别它，或者根本不被所使用的Serde支持，即'org.apache.hadoop.mapred.TextInputFormat'或{{1 }}。

amazon-athena aws-glue aws-glue-data-catalog hive presto

本文链接：https://www.f2er.com/3138332.html

雅典娜（Athena）中的JSON数组读取不正确，无法取消嵌套

dhtz126 回答：雅典娜（Athena）中的JSON数组读取不正确，无法取消嵌套

大家都在问