Clickhouse:由uniq或其他列的总和得出的topK

我们将会话存储在Clickhouse中。一排(其中包括)一个城市,一个持续时间,一个IP和一个代理列。在一次汇总中,我们将按页面分组,并根据IP和代理计算持续时间和uniq的总和。此外,我们正在汇总前5个城市。但是,在选择前五名之前,将根据数据库中的发生次数对城市进行排序。是否可以使用uniq访问者(由agent / IP组合指示)或持续时间之和来确定城市的顺序?

编辑(添加特定查询和更多说明):

          SELECT page,day,CAST(uniqExact(ip,agent) AS UInt16) AS uniqs,topKIf(5)(city,city <> '') AS top_cities,sum(duration) AS total_duration
          FROM pageviews
          WHERE day = toDate('2019-12-24')
          GROUP BY page

top_cities是多少,取决于给定城市的浏览量。我希望top_citiessum(duration)与每个城市或每个城市的uniq ip / agent组合数确定。

我知道我可以GROUP BY page,city,ip,agent并在另外一个步骤中进行最终聚合,但这对于数据集来说将花费很长时间。

code8945 回答:Clickhouse:由uniq或其他列的总和得出的topK

topK返回一个大多数 frequent 值的数组,因此在这里无济于事。

似乎需要使用这样的简单方式:

SELECT
    page,groupArray((city,metric)) AS cityMetricArray,/* Assign each City the numeric unique ID. 
       If your dataset contains CityId then use it instead of this artificial key. */
    arrayMap((x,id) -> (x.1,x.2,id),cityMetricArray,arrayEnumerateDense(arrayMap(x -> (x.1),cityMetricArray))) AS cityMetricCityIdArray,/* Calculate the sum of metrics for each city. 
       Unfortunately sumMap-function accepted only numeric array as key-array,otherwise,passing an array with city names as keys would make code more simple.  */
    arrayReduce('sumMap',[arrayMap(x -> x.3,cityMetricCityIdArray)],[arrayMap(x -> x.2,cityMetricCityIdArray)]) AS cityMetricSumArray,/* Take 5-top cities Ids. */
    arrayReverseSort((cityId,sumMetric) -> sumMetric,cityMetricSumArray.1,cityMetricSumArray.2) AS cityIds,arraySlice(cityIds,1,5) AS topNCityIds,/* Map cityIds to city names. */
    arrayMap(cityId -> arrayFirst(x -> x.3 = cityId,cityMetricCityIdArray).1,topNCityIds) AS topCities
FROM
(   /* test data */
    SELECT
        data.1 AS city,data.2 AS metric,'page' AS page
    FROM
    (
        SELECT arrayJoin([
          ('city1',11),('city2',('city3',('city4',22),('city5',5),('city6',('city7',10)]) AS data
    )
)
GROUP BY page
FORMAT Vertical

/* Result:
page:                  page
cityMetricArray:       [('city1',10)]
cityMetricCityIdArray: [('city1',11,1),2),3),4),22,5,6),10,7)]
cityMetricSumArray:    ([1,2,3,4,6,7],[11,33,10])
cityIds:               [4,7,5]
topNCityIds:           [4,3]
topCities:             ['city4','city2','city6','city1','city3']
*/
本文链接:https://www.f2er.com/2976999.html

大家都在问