python从多个.gz文件中提取关键字

2024-05-03 • 问答

问题：如何从Python中的多个文件（包括压缩的gz文件和未压缩的文件）中搜索关键字我在一个文件夹中有多个存档的日志，最新的文件是“邮件”，而较旧的日志将自动压缩为.gz文件。

-rw ------- 1根root 21262610 Nov 4 11:20消息
-rw ------- 1根root 3047453 Nov 2 15:49 messages-20191102-1572680982.gz
-rw ------- 1根root 3018032 Nov 3 04:43 messages-20191103-1572727394.gz
-rw ------- 1个根目录3026617 Nov 3 17:32 messages-20191103-1572773536.gz
-rw ------- 1根root 3044692 Nov 4 06:17 messages-20191104-1572819469.gz

我写了一个函数：

将所有文件名存储在列表中。（成功）
打开列表中的每个文件（如果是gz文件），请使用gzip.open（）。
搜索关键字

但是我认为这种方式不是很聪明，因为实际上消息日志很大，并且被分成多个gz文件。而且我在关键字文件中存储了很多关键字。

因此，有更好的解决方案将所有文件连接到I / O流中，然后从流中提取关键字。

def open_all_message_files(path):

    files_list=[]
    for root,dirs,files in os.walk(path):
        for file in files:
            if file.startswith("messages"):
                files_list.append(os.path.join(root,file))

    for x in files_list:
            if x.endswith('gz'):
                with gzip.open(x,"r") as f:
                    for line in f:
                        if b'keywords_1' in line:
                          print(line)
                        if b'keywords_2' in line:
                          print(line)
            else:
                with open(x,"r") as f:
                    for line in f:
                        if 'keywords_1' in line:
                            print(line)
                        if 'keywords_2' in line:
                            print(line)

import os import gzip import re import fnmatch def find_files(pattern,path): """ Here you can find all the filenames that match a specific pattern using shell wildcard pattern that way you avoid hardcoding the file pattern i.e 'messages' """ for root,dirs,files in os.walk(path): for name in fnmatch.filter(files,pattern): yield os.path.join(root,name) def file_opener(filenames): """ Open a sequence of filenames one at a time and make sure to close the file once we are done scanning its content. """ for filename in filenames: if filename.endswith('.gz'): f = gzip.open(filename,'rt') else: f = open(filename,'rt') yield f f.close() def chain_generators(iterators): """ Chain a sequence of iterators together """ for it in iterators: # Look up yield from if you're unsure what it does yield from it def grep(pattern,lines): """ Look for a pattern in a line """ pat = re.compile(pattern) for line in lines: if pat.search(line): yield line # A simple way to use these functions together logs = find_files('messages*','One/two/three') files = file_opener(logs) lines = chain_generators(files) each_line = grep('keywords_1',lines) for match in each_line: print(match)

python从多个.gz文件中提取关键字

dahom 回答：python从多个.gz文件中提取关键字

大家都在问