基于FASTA字符串中数值比较的数据分离

我正在尝试使用read.fasta将我从FASTA文件导入的数据分离到所有数据的列表中,底部的粗体数字至少比其右侧的数字大5倍

“> NP_000005.2 \ t1474 \ tGI:66932947 \ tnm_000014.4:114..4538 \ tGeneID:2 \ tA2M \ talpha-2-巨球蛋白前体[智人]。\ tSAP2:rs200486795; S1411R; MAF = 0.0119 ,0.025,0.0161“ attr(,“ class”)

任何建议将不胜感激。

nancy69669 回答:基于FASTA字符串中数值比较的数据分离

我对seqinr不太熟悉,并且在读取fasta标头时有些奇怪。所以我在下面用Biostrings包展示一个例子:

library(Biostrings)
fa = readAAStringSet("muscle.var.fasta")
head(names(fa),1)
[1] "NP_000005.2\t1474\tGI:66932947\tNM_000014.4:114..4538\tGeneID:2\tA2M\talpha-2-macroglobulin precursor [Homo sapiens].\tSAP1:rs376343602;T1455M;MAF=0.0243,0.0,0.0168" 

这是示例数据集:

Data <- c("NP_000005.2\t1474\tGI:66932947\tNM_000014.4:114..4538\tGeneID:2\tA2M\talpha-2-macroglobulin precursor [Homo sapiens].\tSAP1:rs376343602;T1455M;MAF=0.0243,0.0168","NP_000005.2\t1474\tGI:66932947\tNM_000014.4:114..4538\tGeneID:2\tA2M\talpha-2-macroglobulin precursor [Homo sapiens].\tSAP2:rs200486795;S1411R;MAF=0.0119,0.025,0.0161","NP_000005.2\t1474\tGI:66932947\tNM_000014.4:114..4538\tGeneID:2\tA2M\talpha-2-macroglobulin precursor [Homo sapiens].\tSAP3:rs181129451;R1407W;MAF=0.0,0.7992,0.2571","NP_000005.2\t1474\tGI:66932947\tNM_000014.4:114..4538\tGeneID:2\tA2M\talpha-2-macroglobulin precursor [Homo sapiens].\tSAP4:rs374966730;T1359I;MAF=0.0118,0.0080","NP_000005.2\t1474\tGI:66932947\tNM_000014.4:114..4538\tGeneID:2\tA2M\talpha-2-macroglobulin precursor [Homo sapiens].\tSAP5:rs372632979;V1345M;MAF=0.0118,0.0080"
)

我们首先将标题转换为表格形式,

test = read.delim(text=Data,header=F,stringsAsFactors=F)

,我们想要包含MAF的第8列,我们使用strsplit根据字符拆分该列; =,:

strsplit(test[1,8],"[;=,]")
[[1]]
[1] "SAP1:rs376343602" "T1455M"           "MAF"              "0.0243"          
[5] "0.0"              "0.0168" 

分割后,我们需要第4到第6个值,下面是一些基于此strsplit的代码

AF = t(sapply(strsplit(test[,]"),function(i)as.numeric(i[4:6])))
> head(AF)
       [,1]   [,2]   [,3]
[1,] 0.0243 0.0000 0.0168
[2,] 0.0119 0.0250 0.0161
[3,] 0.0000 0.7992 0.2571
[4,] 0.0118 0.0000 0.0080
[5,] 0.0118 0.0000 0.0080

它为您提供名称1到5的第一个,第二个和第三个值,在“ MAF =“之后”

在此示例中我们想要的是第1,4和5行,我们可以使用

wh = which(AF[,1]>5*AF[,2])
#gives the header which fulfills this

test[wh,]
Data[wh]

让我们将其应用于整个数据集

test = read.delim(text=names(fa),stringsAsFactors=F)
AF = t(sapply(strsplit(test[,function(i)as.numeric(i[4:6])))
wh = which(AF[,2])
# check hits
length(wh)
[1] 30209
# tail hits
tail(names(fa[wh]),2)
[1] "NP_998839.1\t284\tGI:47519616\tNM_213674.1:240..1094\tGeneID:7169\tTPM2\ttropomyosin beta chain isoform Tpm2.1sm/cy [Homo sapiens].\tSAP4:rs146271535;C36S;MAF=0.0116,0.0077"
[2] "NP_998890.1\t89\tGI:47524167\tNM_213725.1:130..399\tGeneID:6176\tRPLP1\t60S acidic ribosomal protein P1 isoform 2 [Homo sapiens].\tSAP2:rs149295760;N28S;MAF=0.0116,0.0077"

您可以写出fasta文件或名称:

writeLines(names(fa)[wh],"hits.names.txt")
writeXStringSet(fa[wh],"hits.fa")
本文链接:https://www.f2er.com/3161846.html

大家都在问