我试图只从NCBI xml BLAST文件中提取第一个命中.接下来我想获得第一个HSP.在最后阶段,我想根据最高分获得这些.
在这里清楚地说明xml文件的一个示例:
在这里清楚地说明xml文件的一个示例:
- <?xml version="1.0"?>
- <!DOCTYPE BlastOutput PUBLIC "-//NCBI//NCBI BlastOutput/EN" "http://www.ncbi.nlm.nih.gov/dtd/NCBI_BlastOutput.dtd">
- <BlastOutput>
- <BlastOutput_program>blastx</BlastOutput_program>
- <BlastOutput_version>blastx 2.2.22 [Sep-27-2009]</BlastOutput_version>
- <BlastOutput_reference>~Reference: Altschul,Stephen F.,Thomas L. Madden,Alejandro A. Schaffer,~Jinghui Zhang,Zheng Zhang,Webb Miller,and David J. Lipman (1997),~"Gapped BLAST and PSI-BLAST: a new generation of protein database search~programs",Nucleic Acids Res. 25:3389-3402.</BlastOutput_reference>
- <BlastOutput_db>/Applications/blast/db/viral1.protein.faa</BlastOutput_db>
- <BlastOutput_query-ID>lcl|1_0</BlastOutput_query-ID>
- <BlastOutput_query-def>DSAD-090629_plate11A01a.g1 CHROMAT_FILE: DSAD-090629_plate11A01a.g1 PHD_FILE: DSAD-090629_plate11A01a.g1.phd.1 CHEM: term DYE: big TIME: Thu Sep 17 15:33:59 2009 TEMPLATE: DSAD-090629_plate11A01a DIRECTION: rev</BlastOutput_query-def>
- <BlastOutput_query-len>1024</BlastOutput_query-len>
- <BlastOutput_param>
- <Parameters>
- <Parameters_matrix>BLOSUM62</Parameters_matrix>
- <Parameters_expect>1e-05</Parameters_expect>
- <Parameters_gap-open>11</Parameters_gap-open>
- <Parameters_gap-extend>1</Parameters_gap-extend>
- <Parameters_filter>F</Parameters_filter>
- </Parameters>
- </BlastOutput_param>
- <BlastOutput_iterations>
- <Iteration>
- <Iteration_iter-num>1</Iteration_iter-num>
- <Iteration_query-ID>lcl|1_0</Iteration_query-ID>
- <Iteration_query-def>DSAD-090629_plate11A01a.g1 CHROMAT_FILE: DSAD-090629_plate11A01a.g1 PHD_FILE: DSAD-090629_plate11A01a.g1.phd.1 CHEM: term DYE: big TIME: Thu Sep 17 15:33:59 2009 TEMPLATE: DSAD-090629_plate11A01a DIRECTION: rev</Iteration_query-def>
- <Iteration_query-len>1024</Iteration_query-len>
- <Iteration_stat>
- <Statistics>
- <Statistics_db-num>68007</Statistics_db-num>
- <Statistics_db-len>19518578</Statistics_db-len>
- <Statistics_hsp-len>0</Statistics_hsp-len>
- <Statistics_eff-space>0</Statistics_eff-space>
- <Statistics_kappa>0.041</Statistics_kappa>
- <Statistics_lambda>0.267</Statistics_lambda>
- <Statistics_entropy>0.14</Statistics_entropy>
- </Statistics>
- </Iteration_stat>
- <Iteration_message>No hits found</Iteration_message>
- </Iteration>
- <Iteration>
- <Iteration>
- <Iteration_iter-num>6</Iteration_iter-num>
- <Iteration_query-ID>lcl|6_0</Iteration_query-ID>
- <Iteration_query-def>DSAD-090629_plate11A05a.g1 CHROMAT_FILE: DSAD-090629_plate11A05a.g1 PHD_FILE: DSAD-090629_plate11A05a.g1.phd.1 CHEM: term DYE: big TIME: Thu Sep 17 15:33:59 2009 TEMPLATE: DSAD-090629_plate11A05a DIRECTION: rev</Iteration_query-def>
- <Iteration_query-len>1068</Iteration_query-len>
- <Iteration_hits>
- <Hit>
- <Hit_num>1</Hit_num>
- <Hit_id>gnl|BL_ORD_ID|23609</Hit_id>
- <Hit_def>gi|38707884|ref|NP_945016.1| Putative ribose-phosphate pyrophosphokinase [Enterobacteria phage Felix 01]</Hit_def>
- <Hit_accession>23609</Hit_accession>
- <Hit_len>293</Hit_len>
- <Hit_hsps>
- <Hsp>
- <Hsp_num>1</Hsp_num>
- <Hsp_bit-score>49.2914</Hsp_bit-score>
- <Hsp_score>116</Hsp_score>
- <Hsp_evalue>5.15408e-06</Hsp_evalue>
- <Hsp_query-from>580</Hsp_query-from>
- <Hsp_query-to>792</Hsp_query-to>
- <Hsp_hit-from>202</Hsp_hit-from>
- <Hsp_hit-to>273</Hsp_hit-to>
- <Hsp_query-frame>-1</Hsp_query-frame>
- <Hsp_identity>26</Hsp_identity>
- <Hsp_positive>45</Hsp_positive>
- <Hsp_gaps>2</Hsp_gaps>
- <Hsp_align-len>73</Hsp_align-len>
- <Hsp_qseq>MHIIGDVE--GRTCILVDDMVDTAGTLCHAAKALKERGAAKVYAYCTHPVLSGRAIENIENSVLDELVVTNTI</Hsp_qseq>
- <Hsp_hseq>MRILDDVDLTDKTVMILDDICDGGRTFVEAAKHLREAGAKRVELYVTHGIFS-KDVENLLDNGIDHIYTTNSL</Hsp_hseq>
- <Hsp_midline>M I+ DV+ +T +++DD+ D T AAK L+E GA +V Y TH + S + +EN+ ++ +D + TN++</Hsp_midline>
- </Hsp>
- </Hit_hsps>
- </Hit>
- <Hit>
- <Hit_num>2</Hit_num>
- <Hit_id>gnl|BL_ORD_ID|2466</Hit_id>
- <Hit_def>gi|51557505|ref|YP_068339.1| large tegument protein [Suid herpesvirus 1]</Hit_def>
- <Hit_accession>2466</Hit_accession>
- <Hit_len>3084</Hit_len>
- <Hit_hsps>
- <Hsp>
- <Hsp_num>1</Hsp_num>
- <Hsp_bit-score>48.9062</Hsp_bit-score>
- <Hsp_score>115</Hsp_score>
- <Hsp_evalue>6.70494e-06</Hsp_evalue>
- <Hsp_query-from>369</Hsp_query-from>
- <Hsp_query-to>875</Hsp_query-to>
- <Hsp_hit-from>2312</Hsp_hit-from>
- <Hsp_hit-to>2465</Hsp_hit-to>
- <Hsp_query-frame>-2</Hsp_query-frame>
- <Hsp_identity>52</Hsp_identity>
- <Hsp_positive>70</Hsp_positive>
- <Hsp_gaps>4</Hsp_gaps>
- <Hsp_align-len>173</Hsp_align-len>
- <Hsp_qseq>APESQEPGASTWRSSTSVVKKGQPSQK*CTSSVTSKAVPASWSTTWSTLPAPCATPPKR*KSAAPPRSTPTAPTRCCPAAPSRTSRIPSWTSWWSPTPSRCPLRRSPARVFASSTSPR-SSPKRSAASATKNRSAP---CSAKRNWPDHTAPPRAGLFALPPEAGRKPQGGLV</Hsp_qseq>
- <Hsp_hseq>APPAQKPPAQPATAAATTAPKATPQTQPPTRAQTQTAPPPPSAAT-----AAAQVPPQ------PPSSQPAAKPRGAPPAPPAPP--PPSAQTTLPRPAAPPAPPPPS---AQTTLPRPAPPPPSAPAATPTPPAPGPAPSAKKSDGDRIVEPKAG---APPDVRDAKFGGKV</Hsp_hseq>
- <Hsp_midline>AP +Q+P A ++ + K P + T + T A P + T A PP+ PP S P A R P AP P P P+ P P+ A +T PR + P SA +AT AP SAK++ D P+AG PP+ GG V</Hsp_midline>
- </Hsp>
- </Hit_hsps>
- </Hit>
- </Iteration_hits>
- <Iteration_stat>
- <Statistics>
- <Statistics_db-num>68007</Statistics_db-num>
- <Statistics_db-len>19518578</Statistics_db-len>
- <Statistics_hsp-len>0</Statistics_hsp-len>
- <Statistics_eff-space>0</Statistics_eff-space>
- <Statistics_kappa>0.041</Statistics_kappa>
- <Statistics_lambda>0.267</Statistics_lambda>
- <Statistics_entropy>0.14</Statistics_entropy>
- </Statistics>
- </Iteration_stat>
- </Iteration>
基本上每个查询搜索都会创建一个迭代元素.每次迭代都可以有多次命中,而后者又可以有多个HSP.我想只获得第一个命中,它是每次迭代的第一个HSP.如果BLAST没有发现命中,我想忽略迭代.
我编写了这个简单的代码:
- #!/usr/bin/env python
- from elementtree.ElementTree import parse
- from elementtree import ElementTree as ET
- file = open("/Applications/blast/blanes_viral_nr_results.xml","r")
- save_file = open("/Applications/blast/Blast_parse_ET.txt",'w')
- tree = parse(file)
- elem = tree.getroot()
- print elem
- Per_ID = ()
- save_file.write('>%s\t%s\t%s\t%s\t%s\t%s\t\n\n\n\n' % ("It_Num\t","It_ID\t","Hit_Def\t","Num\t","ID\t","ACC\t"))
- iteration = tree.findall('BlastOutput_iterations/Iteration')
- for iteration in iteration:
- for hit in iteration.findall('Iteration_hits/Hit'):
- It_Num = iteration.findtext('Iteration_iter-num')
- It_ID = iteration.findtext('Iteration_query-def')
- Hit_Def = hit.findtext('Hit_def')
- Num = hit.findtext('Hit_num')
- ID = hit.findtext('Hit_id')
- DEF = hit.findtext('Hit_def')
- ACC = hit.findtext('Hit_accession')
- save_file.write('>%s\t%s\t%s\t%s\t%s\t%s\t' % (It_Num,It_ID[12:26],Hit_Def[1:10],Num,ID,ACC,))
- for hsp in hit.findall('Hit_hsps'):
- HSPN = hsp.findtext('Hsp/Hsp_num')
- identities = hsp.findtext('Hsp/Hsp_identity')
- #print 'id: ',identities.rjust(4),length = hsp.findtext('Hsp/Hsp_align-len')
- #print 'len:',length.rjust(4),Per_ID = int(identities) * 100.0 / int(length)
- #print hsp.findtext('Hsp/Hsp_qseq')[:50]
- #print hsp.findtext('Hsp/Hsp_midline')[:50]
- #print hsp.findtext('Hsp/Hsp_hseq')[:50]
- save_file.write('%s\t%s\t%s\%st\n' % ('***','%',HSPN,Per_ID))
- save_file.write('n\n' % ())
任何帮助都会非常受欢迎!
虽然构建自己的解析器可能很“有趣”,但已经有一个可以解析BLAST xml文件的包……如果你愿意,它甚至可以为你进行本地BLAST实例的中间调用.
主要网站在这里:
http://biopython.org/wiki/Biopython
XML BLAST解析器在这里:
http://biopython.org/DIST/docs/tutorial/Tutorial.html#htoc82
就像是:
- from Bio.Blast import NCBIXML
- with open('xml/results/file') as handle:
- all_records = NCBIXML.parse(handle)
- first_record = all_records.next()
应该管用.我一般都喜欢BioPython解析器和编写器,但我不喜欢类结构组织.所以我通常只使用解析器并将我需要的信息提取到我自己的结构中.
因人而异
希望有所帮助.