我有一个任意数量的人的身份表(即别名).每行都有一个以前的名称和一个新名称.在生产中,大约有1M行.例如:
id,old,new --- 1,'Albert','Bob' 2,'Bob','Charles' 3,'Mary','Nancy' 4,'Charles','Albert' 5,'Lydia','Nancy' 6,'Zoe','Zoe'
我想要的是生成用户列表并引用他们各自的身份.这类似于查找连接标识的每个图形中的所有节点,或查找生成林:
User 1: Albert,Bob,Charles (identities: 1,2,4) User 2: Mary,Nancy,Lydia (identities: 3,5) User 3: Zoe (identities: 6)
我一直在修补Postgresql的WITH RECURSIVE,但它为每个产生了集合和子集.例如:
1,4 <-- spanning tree: good 2 <-- subset: discard 3,5 <-- spanning tree: good 4 <-- subset: discard 5 <-- subset: discard 6 <-- spanning tree: good
sqlFiddle:http://sqlfiddle.com/#!15/9eaed/4我的最新尝试.这是代码:
WITH RECURSIVE search_graph AS ( SELECT id,id AS min_id,ARRAY[id] AS path,ARRAY[old,new] AS emails FROM identities UNION SELECT identities.id,LEAST(identities.id,sg.min_id),(sg.path || identities.id),(sg.emails || identities.old || identities.new) FROM search_graph sg JOIN identities ON (identities.old = ANY(sg.emails) OR identities.new = ANY(sg.emails)) WHERE identities.id <> ALL(sg.path) ) SELECT array_agg(DISTINCT(p)) from search_graph,unnest(path) p GROUP BY min_id;
结果如下:
1,4 2 3,5 4 5 6
解决方法
我刚才写了一个类似问题的答案:
How to find all connected subgraphs of an undirected graph.在那个问题中,我使用了sql Server.有关中间CTE的详细说明,请参阅该答案.我将该查询改编为Postgres.
使用Postgres数组功能可以更有效地编写它,而不是将路径连接到文本列.
WITH RECURSIVE CTE_Idents AS ( SELECT old AS Ident FROM identities UNION SELECT new AS Ident FROM identities ),CTE_Pairs AS ( SELECT old AS Ident1,new AS Ident2 FROM identities WHERE old <> new UNION SELECT new AS Ident1,old AS Ident2 FROM identities WHERE old <> new ),CTE_Recursive AS ( SELECT CTE_Idents.Ident AS AnchorIdent,Ident1,Ident2,',' || Ident1 || ',' || Ident2 || ',' AS IdentPath,1 AS Lvl FROM CTE_Pairs INNER JOIN CTE_Idents ON CTE_Idents.Ident = CTE_Pairs.Ident1 UNION ALL SELECT CTE_Recursive.AnchorIdent,CTE_Pairs.Ident1,CTE_Pairs.Ident2,CTE_Recursive.IdentPath || CTE_Pairs.Ident2 || ',CTE_Recursive.Lvl + 1 AS Lvl FROM CTE_Pairs INNER JOIN CTE_Recursive ON CTE_Recursive.Ident2 = CTE_Pairs.Ident1 WHERE CTE_Recursive.IdentPath NOT LIKE ('%,' || CTE_Pairs.Ident2 || ',%') ),CTE_RecursionResult AS ( SELECT AnchorIdent,Ident2 FROM CTE_Recursive ),CTE_CleanResult AS ( SELECT AnchorIdent,Ident1 AS Ident FROM CTE_RecursionResult UNION SELECT AnchorIdent,Ident2 AS Ident FROM CTE_RecursionResult ),CTE_Groups AS ( SELECT CTE_Idents.Ident,array_agg(COALESCE(CTE_CleanResult.Ident,CTE_Idents.Ident) ORDER BY COALESCE(CTE_CleanResult.Ident,CTE_Idents.Ident)) AS AllIdents FROM CTE_Idents LEFT JOIN CTE_CleanResult ON CTE_CleanResult.AnchorIdent = CTE_Idents.Ident GROUP BY CTE_Idents.Ident ) SELECT AllIdents FROM CTE_Groups GROUP BY AllIdents ;
我在样本数据中添加了一行(7,X,Y).
结果
| allidents | |--------------------| | Lydia,Mary,Nancy | | Albert,Charles | | X,Y | | Zoe |