带有PostgreSQL 9.6全文搜索的词组频率计数器

2024-05-10 • 问答

我需要针对索引的文本字段（ts_query数据类型）使用ts_vector计算短语出现的次数。它可以工作，但是非常慢，因为表很大。对于单个单词，我已经预先计算了所有频率，但是我不知道如何提高词组搜索的速度。

编辑：感谢您的回复@jjanes。

这是我的查询：

SELECT substring(date_input::text,5) as myear,ts_headline('simple',text_input,q,'StartSel=<b>,StopSel=</b>,MaxWords=2,MinWords=1,ShortWord=1,HighlightAll=FALSE,MaxFragments=9999,FragmentDelimiter=" ... "') as headline 
FROM 
db_test,to_tsquery('simple','united<->kingdom') as q WHERE date_input BETWEEN '2000-01-01'::DATE AND '2019-12-31'::DATE and idxfti_simple @@ q

这是EXPLAIN (ANALYZE,BUFFERS)的输出：

nested Loop  (cost=25408.33..47901.67 rows=5509 width=64) (actual time=286.536..17133.343 rows=38127 loops=1)
Buffers: shared hit=224723
    ->  Function Scan on q  (cost=0.00..0.01 rows=1 width=32) (actual time=0.005..0.007 rows=1 loops=1)
    ->  Append  (cost=25408.33..46428.00 rows=5510 width=625) (actual time=285.372..864.868 rows=38127 loops=1)
        Buffers: shared hit=165713
        ->  Bitmap Heap Scan on db_test  (cost=25408.33..46309.01 rows=5509 width=625) (actual time=285.368..791.111 rows=38127 loops=1)
            Recheck Cond: ((idxfti_simple @@ q.q) AND (date_input >= '2000-01-01'::date) AND (date_input <= '2019-12-31'::date))
            Rows Removed by Index Recheck: 136
            Heap Blocks: exact=29643
            Buffers: shared hit=165607
                ->  BitmapAnd  (cost=25408.33..25408.33 rows=5509 width=0) (actual time=278.370..278.371 rows=0 loops=1)
                Buffers: shared hit=3838
                    ->  Bitmap Index Scan on idxftisimple_idx  (cost=0.00..1989.01 rows=35869 width=0) (actual time=67.280..67.281 rows=176654 loops=1)
                        Index Cond: (idxfti_simple @@ q.q)
                        Buffers: shared hit=611
                    ->  Bitmap Index Scan on db_test_date_input_idx  (cost=0.00..23142.24 rows=1101781 width=0) (actual time=174.711..174.712 rows=1149456 loops=1)
                        Index Cond: ((date_input >= '2000-01-01'::date) AND (date_input <= '2019-12-31'::date))
                        Buffers: shared hit=3227
        ->  Seq Scan on test  (cost=0.00..118.98 rows=1 width=451) (actual time=0.280..0.280 rows=0 loops=1)
            Filter: ((date_input >= '2000-01-01'::date) AND (date_input <= '2019-12-31'::date) AND (idxfti_simple @@ q.q))
            Rows Removed by Filter: 742
            Buffers: shared hit=106

Planning time: 0.332 ms
Execution time: 17176.805 ms

对不起，我无法将track_io_timing设置为打开。我确实不建议您使用ts_headline，但我需要它来计算短语在相同字段中出现的次数。

在此先感谢您的帮助。

版本9.2较旧且不受支持。首先，它没有对短语搜索的本地支持（在9.6中引入）。

请升级。

如果仍然很慢，请向我们显示查询及其EXPLAIN (ANALYZE,BUFFERS)，最好打开track_io_timing。

请注意，在“位图堆扫描”中获取行非常快，不到0.8秒，并且几乎所有时间都花在顶层节点上。这段时间很可能会花费在ts_headline中，用于重新分析text_input文档。只要您继续使用ts_headline，您就无能为力了。

ts_headline不会直接给您想要的内容（频率），因此您必须对其进行某种后处理。也许您可以直接对tsvector进行后处理，所以不需要重新解析文档。

另一个选择是进一步升级，这可以使ts_headline的工作分散在多个CPU上。 PostgreSQL 9.6是第一个支持并行查询的版本，并且该版本还不够成熟，无法并行化这种类型的东西。 v10可能足以对此进行并行查询，但是您也可以一路跳到v12。

带有PostgreSQL 9.6全文搜索的词组频率计数器

quanlong123456 回答：带有PostgreSQL 9.6全文搜索的词组频率计数器

大家都在问