无法从Fasta标头创建文本文件

2024-05-08 • 问答

我在一个目录中有多个fasta文件。我需要为每个单独的fasta文件创建具有特定格式的文本文件。生成的文本文件将用作下游另一个程序的输入文件。我是python的新手，我确信下面的脚本和它们一样笨拙。我确信在biopython中有更好的方法来完成此任务。

import os
import re

for FILE in os.listdir():
    if FILE.endswith(".fasta"):
        OUTPUT = open(FILE+".lft",'w')
        with open(FILE,'r') as FIN:
            for LINE in FIN:
                if LINE.startswith('>'):
                    HEADER = re.sub('>','',LINE)
                    HEADER2 = re.sub('\n',HEADER)
                    PART1_HEADER = HEADER2.split(":")
                    CONTIG = str(PART1_HEADER[0])
                    PART2_HEADER = PART1_HEADER[1]
                    SPLIT_PART2 = PART2_HEADER.split("-")
                    START = int(SPLIT_PART2[0])
                    END = int(SPLIT_PART2[1])
                    LENGTH = END-START
                    OUTPUT.write(str(START) + '\t' + str(HEADER2) + '\t' + str(LENGTH) + '\t' + str(CONTIG) + '\t' + str(END) + '\n')

以下是每个fasta文件中标头的示例：

>Spp-0:0-500
>Spp-1:0-3538
>Spp-2:0-1421
>Spp-3:0-500

位置：

“ Spp” =物种名称

“-0” =斋戒中的序列ID

“：0-500” =序列的开始和结束位置（并非全部从零开始）。

我正在寻找一个文本文件，如下所示：

0       aVan-0:0-500    500     aVan-0  500
0       aVan-1:0-3538   3538    aVan-1  3538
0       aVan-2:0-1421   1421    aVan-2  1421
0       aVan-3:0-500    500     aVan-3  500

1st column: start position
2nd column: original header
3rd column: end position
4th column: everything before the ":"
5th column: the length between start & stop positions

我的代码运行得很好，但是如果目录中有25个fasta序列，我的代码将只处理前24个。它还会在屏幕上输出一堆数字（通常是41或44 ...不确定为什么？），我也想摆脱它。

awk -F ':' '/^>/ {split($2,a,/\-/);printf("%s\t%s\t%s\t%s\t%s\n",a[1],substr($0,2),a[2],substr($1,int(a[2])-int(a[1]));}' in.fasta 0 Spp-0:0-500 500 Spp-0 500 0 Spp-1:0-3538 3538 Spp-1 3538 0 Spp-2:0-1421 1421 Spp-2 1421 0 Spp-3:0-500 500 Spp-3 500

#!/usr/bin/env python3 import os def main(): # Don't use capital letters for variable names. for filename in os.listdir(): if filename.endswith(".fasta"): # Use context manager for both input and output. with open(filename + ".lft",'w') as output,\ open(filename,'r') as fasta_in: for line in fasta_in: if line.startswith('>'): # No need to use re.sub to just skip the first # character (which you know is a '>'). # .strip() will remove "blanks" at ends. header = line[1:].strip() # You can "unpack" a list to directly store the # parts in variables [contig,coords] = header.split(":") [start,end] = coords.split("-") length = int(end) - int(start) output.write("\t".join([ start,header,str(length),contig,end])) output.write("\n") main()

无法从Fasta标头创建文本文件

since1129 回答：无法从Fasta标头创建文本文件

大家都在问