正则表达式在某些位置附加一些字符

我有一个看起来像这样的txt文件:

abandon(icl>leave>do,agt>person,obj>person);CAT(CATV),AUX(AVOIR),VAL1(GN) ; 

由于它是一个非常长的txt,因此我想使用正则表达式对其进行修改。 我要在每个CAT(...)之前和第一个“;”之后追加每行的第一个单词。还应该有第二个“;”在附加单词之后,在CAT之前。我该怎么做?

所以我的输出将是:

abandon(icl>leave>do,obj>person);abandon;CAT(CATV),VAL1(GN) ;
morenming 回答:正则表达式在某些位置附加一些字符

您可以在正则表达式模式下尝试以下查找和替换:

Find:    ^([^(]+)(.*?;)(CAT.*)$
Replace: $1$2$1;$3

这里的想法是将每一行细分成我们需要将替换物串在一起的部分。在这种情况下,第一个捕获组是我们计划在第一个分号之后CAT之前插入的单词。

Demo

仅注意到您正在使用Python。我们可以尝试:

inp = """aarhus(iof>city>thing,equ>arhus);CAT(CATN),N(NP) ;
abadan(iof>city>thing);CAT(CATN),N(NP) ;
abandon(icl>leave>do,agt>person,obj>person);CAT(CATV),AUX(AVOIR),VAL1(GN) ;"""
output = re.sub(r'([^(]+)(.*?;)(CAT.*?;)\s*','\\1\\2\\1;\\3\n',inp)
print(output)

此打印:

aarhus(iof>city>thing,equ>arhus);aarhus;CAT(CATN),N(NP) ;
abadan(iof>city>thing);abadan;CAT(CATN),obj>person);abandon;CAT(CATV),VAL1(GN) ;
,

在Python中,您可以执行以下操作:

import re

test_strings = [
    'aarhus(iof>city>thing,N(NP) ;','abadan(iof>city>thing);CAT(CATN),'abandon(icl>leave>do,VAL1(GN) ;' 
]
# first group matches the wordthat you want to repeat,then you capture the rest
# until the ;CAT which you capture separately
regex = r'(\w+)(.*)(;CAT.*)'

new_strings = []
for test_string in test_strings:
    match = re.match(regex,test_string)
    new_string = match.group(1) + match.group(2) + ";" + match.group(1) + match.group(3)
    new_strings.append(new_string)
    print(new_string)

给你:

aarhus(iof>city>thing,VAL1(GN) ;

您的字符串存储在new_strings列表中。

编辑: 要将文件读取为准备好修改的字符串列表,只需使用with open语句并执行readlines()

my_file = 'my_text_file.txt'

with open(my_file,'r') as f:
    my_file_as_list = f.readlines()
,

匹配不同的组并进行编织可能比用正则表达式替换要快。必须测试

import re

#=== DESIRED ===================================================================
# aarhus(iof>city>thing,N(NP) ;
# abadan(iof>city>thing);abadan;CAT(CATN),N(NP) ;
# abandon(icl>leave>do,VAL1(GN) ;```
#===============================================================================

data = ["abadan(iof>city>thing);CAT(CATN),N(NP) ;","abandon(icl>leave>do,VAL1(GN) ;"]

# Matching different groups,and then stiching together may be faster tna a regex replace. 
# Basedon https://stackoverflow.com/questions/3850074/regex-until-but-not-including
# (?:(?!CAT).)* - match anything until the start of the word CAT.
# I.e.
# (?:        # Match the following but do not capture it:
# (?!CAT)  # (first assert that it's not possible to match "CAT" here
#  .         # then match any character
# )*         # end of group,zero or more repetitions.
p = ''.join(["^",# Match start of string
             "(.*?(?:(?!\().)*)",# Match group one,anything up to first open paren,which will be the first word (I.e. abadan or abandon
             "(.*?(?:(?!CAT).)*)",# Group 2,match everything after group one,up to "CAT" but not including CAT
             "(.*$)" # Match the rest
             ])

for line in data:
    m = re.match(p,line)    
    newline  = m.group(1) # First word
    newline += m.group(2) # Group two
    newline += m.group(1) + ";" # First word again with semi-colon
    newline += m.group(3) # Group three

    print(newline)

输出:

abadan(iof>city>thing);abadan;CAT(CATN),VAL1(GN) ;
,

此脚本读取输入文件,进行替换并将其写入输出文件:

import re

infile = 'input.txt'
outfile = 'outfile.txt'
f = open(infile,'r')
o = open(outfile,'w')
for line in f:
    o.write(re.sub(r'((\w+).+?)(?=;CAT)',r'\1;\2',line))

cat outfile.txt 
aarhus(iof>city>thing,VAL1(GN) ; 
本文链接:https://www.f2er.com/3166065.html

大家都在问