如何聚合超过RAM gzip的csv文件的值?

前端之家收集整理的这篇文章主要介绍了如何聚合超过RAM gzip的csv文件的值?前端之家小编觉得挺不错的,现在分享给大家,也给大家做个参考。
对于初学者我不熟悉生物信息学,特别是编程,但我已经构建了一个脚本,它将通过一个所谓的VCF文件(只包括个体,一个clumn =一个人),并使用搜索字符串来查找对于每个变体(系),个体是纯合的还是杂合的.

这个脚本起作用,至少在小子集上,但我知道它将所有东西都存储在内存中.我想在非常大的压缩文件(甚至整个基因组)上做这个,但我不知道如何将这个脚本转换成一个逐行完成所有操作的脚本(因为我想要计算整列,我只是不要看看如何解决这个问题.

因此,每个人的输出是5件事(总变体,数字纯合子,数字杂合子,以及同源和杂合子的比例).请参阅以下代码

  1. #!usr/bin/env python
  2. import re
  3. import gzip
  4.  
  5. subset_cols = 'subset_cols_chr18.vcf.gz'
  6. #nuc_div = 'nuc_div_chr18.txt'
  7.  
  8. gz_infile = gzip.GzipFile(subset_cols,"r")
  9. #gz_outfile = gzip.GzipFile(nuc_div,"w")
  10.  
  11. # make a dictionary of the header line for easy retrieval of elements later on
  12.  
  13. headers = gz_infile.readline().rstrip().split('\t')
  14. print headers
  15.  
  16. column_dict = {}
  17. for header in headers:
  18. column_dict[header] = []
  19. for line in gz_infile:
  20. columns = line.rstrip().split('\t')
  21. for i in range(len(columns)):
  22. c_header=headers[i]
  23. column_dict[c_header].append(columns[i])
  24. #print column_dict
  25.  
  26. for key in column_dict:
  27. number_homozygotes = 0
  28. number_heterozygotes = 0
  29.  
  30. for values in column_dict[key]:
  31. SearchStr = '(\d)/(\d):\d+,\d+:\d+:\d+:\d+,\d+,\d+'
  32. #this search string contains the regexp (this regexp was tested)
  33. Result = re.search(SearchStr,values)
  34. if Result:
  35. #here,it will skip the missing genoytypes ./.
  36. variant_one = int(Result.group(1))
  37. variant_two = int(Result.group(2))
  38.  
  39. if variant_one == 0 and variant_two == 0:
  40. continue
  41. elif variant_one == variant_two:
  42. #count +1 in case variant one and two are equal (so 0/0,1/1,etc.)
  43. number_homozygotes += 1
  44. elif variant_one != variant_two:
  45. #count +1 in case variant one is not equal to variant two (so 1/0,0/1,etc.)
  46. number_heterozygotes += 1
  47.  
  48. print "%s homozygotes %s" % (number_homozygotes,key)
  49. print "%s heterozygotes %s" % (number_heterozygotes,key)
  50.  
  51. variants = number_homozygotes + number_heterozygotes
  52. print "%s variants" % variants
  53.  
  54. prop_homozygotes = (1.0*number_homozygotes/variants)*100
  55. prop_heterozygotes = (1.0*number_heterozygotes/variants)*100
  56.  
  57. print "%s %% homozygous %s" % (prop_homozygotes,key)
  58. print "%s %% heterozygous %s" % (prop_heterozygotes,key)

任何帮助将不胜感激,所以我可以继续研究大型数据集,
谢谢 :)

顺便说一下,VCF文件看起来像这样:
INDIVIDUAL_1 INDIVIDUAL_2 INDIVIDUAL_3
0/0:9,0:9:24:0,24,221 1/0:5,4:9:25:25,26 1/1:0,13:13:33:347,33,0

这是带有各个ID名称标题行(我总共有33个人,ID标签更复杂,我在这里简化了)然后我有很多这些具有相同特定模式的信息行.我只对斜线的第一部分感兴趣,因此定期表达.

披露:我在Hail项目上全职工作.

嗨,您好!欢迎来到编程和生物信息学!

amirouche正确地指出你需要某种“流式传输”或
“逐行”算法来处理太大而无法容纳在RAM中的数据
你的机器.不幸的是,如果你只限于没有库的python,你
必须手动chunk文件并处理VCF的解析.

Hail project是一款面向科学家的免费开源工具
遗传数据太大,无法容纳在RAM中,直到太大而无法放在RAM上
机器(即数十TB的压缩VCF数据).冰雹可以利用
一台机器上的所有核心或云计算机上的所有核心.冰雹
在Mac OS X和大多数GNU / Linux版本上运行.冰雹暴露了统计数据
遗传学领域特定的语言,使你的问题更短
表达.

最简单的答案

你的python代码最忠实的翻译为Hail是这样的:

  1. /path/to/hail importvcf -f YOUR_FILE.vcf.gz \
  2. annotatesamples expr -c \
  3. 'sa.nCalled = gs.filter(g => g.isCalled).count(),sa.nHom = gs.filter(g => g.isHomRef || g.isHomVar).count(),sa.nHet = gs.filter(g => g.isHet).count()'
  4. annotatesamples expr -c \
  5. 'sa.pHom = sa.nHom / sa.nCalled,sa.pHet = sa.nHet / sa.nCalled' \
  6. exportsamples -c 'sample = s,sa.*' -o sampleInfo.tsv

我在2.0GB文件上运行了我的双核笔记本电脑上面的命令:

  1. # ls -alh profile225.vcf.bgz
  2. -rw-r--r-- 1 dking 1594166068 2.0G Aug 25 15:43 profile225.vcf.bgz
  3. # ../hail/build/install/hail/bin/hail importvcf -f profile225.vcf.bgz \
  4. annotatesamples expr -c \
  5. 'sa.nCalled = gs.filter(g => g.isCalled).count(),sa.nHet = gs.filter(g => g.isHet).count()' \
  6. annotatesamples expr -c \
  7. 'sa.pHom = sa.nHom / sa.nCalled,sa.*' -o sampleInfo.tsv
  8. hail: info: running: importvcf -f profile225.vcf.bgz
  9. [Stage 0:=======================================================> (63 + 2) / 65]hail: info: Coerced sorted dataset
  10. hail: info: running: annotatesamples expr -c 'sa.nCalled = gs.filter(g => g.isCalled).count(),sa.nHet = gs.filter(g => g.isHet).count()'
  11. [Stage 1:========================================================>(64 + 1) / 65]hail: info: running: annotatesamples expr -c 'sa.pHom = sa.nHom / sa.nCalled,sa.pHet = sa.nHet / sa.nCalled'
  12. hail: info: running: exportsamples -c 'sample = s,sa.*' -o sampleInfo.tsv
  13. hail: info: while importing:
  14. file:/Users/dking/projects/hail-data/profile225.vcf.bgz import clean
  15. hail: info: timing:
  16. importvcf: 34.211s
  17. annotatesamples expr: 6m52.4s
  18. annotatesamples expr: 21.399ms
  19. exportsamples: 121.786ms
  20. total: 7m26.8s
  21. # head sampleInfo.tsv
  22. sample pHomRef pHet nHom nHet nCalled
  23. HG00096 9.49219e-01 5.07815e-02 212325 11359 223684
  24. HG00097 9.28419e-01 7.15807e-02 214035 16502 230537
  25. HG00099 9.27182e-01 7.28184e-02 211619 16620 228239
  26. HG00100 9.19605e-01 8.03948e-02 214554 18757 233311
  27. HG00101 9.28714e-01 7.12865e-02 214283 16448 230731
  28. HG00102 9.24274e-01 7.57260e-02 212095 17377 229472
  29. HG00103 9.36543e-01 6.34566e-02 209944 14225 224169
  30. HG00105 9.29944e-01 7.00564e-02 214153 16133 230286
  31. HG00106 9.25831e-01 7.41687e-02 213805 17128 230933

哇! 2GB的7分钟,这很慢!不幸的是,这是因为VCF
不是一个很好的数据分析格式!

优化存储格式

让我们转换为Hail的优化存储格式,VDS,然后重新运行命令:

  1. # ../hail/build/install/hail/bin/hail importvcf -f profile225.vcf.bgz write -o profile225.vds
  2. hail: info: running: importvcf -f profile225.vcf.bgz
  3. [Stage 0:========================================================>(64 + 1) / 65]hail: info: Coerced sorted dataset
  4. hail: info: running: write -o profile225.vds
  5. [Stage 1:> (0 + 4) / 65]
  6. [Stage 1:========================================================>(64 + 1) / 65]
  7. # ../hail/build/install/hail/bin/hail read -i profile225.vds \
  8. annotatesamples expr -c \
  9. 'sa.nCalled = gs.filter(g => g.isCalled).count(),sa.nHet = gs.filter(g => g.isHet).count()' \
  10. annotatesamples expr -c \
  11. 'sa.pHom = sa.nHom / sa.nCalled,sa.pHet = sa.nHet / sa.nCalled' \
  12. exportsamples -c 'sample = s,sa.*' -o sampleInfo.tsv
  13. hail: info: running: read -i profile225.vds
  14. [Stage 1:> (0 + 0) / 4]SLF4J: Failed to load class "org.slf4j.impl.StaticLoggerBinder".
  15. SLF4J: Defaulting to no-operation (NOP) logger implementation
  16. SLF4J: See http://www.slf4j.org/codes.html#StaticLoggerBinder for further details.
  17. [Stage 1:============================================> (3 + 1) / 4]hail: info: running: annotatesamples expr -c 'sa.nCalled = gs.filter(g => g.isCalled).count(),sa.nHet = gs.filter(g => g.isHet).count()'
  18. [Stage 2:========================================================>(64 + 1) / 65]hail: info: running: annotatesamples expr -c 'sa.pHom = sa.nHom / sa.nCalled,sa.*' -o sampleInfo.tsv
  19. hail: info: timing:
  20. read: 2.969s
  21. annotatesamples expr: 1m20.4s
  22. annotatesamples expr: 21.868ms
  23. exportsamples: 151.829ms
  24. total: 1m23.5s

大约快五倍!关于更大规模,在代表完整VCF的VDS上在Google云上运行相同的命令,1000 Genomes Project(2535全基因组,约315GB压缩)使用328个工作核心花费3m42s.

使用冰雹内置

Hail还有一个sampleqc命令,可以计算你想要的大部分内容(和
更多!):

  1. ../hail/build/install/hail/bin/hail read -i profile225.vds \
  2. sampleqc \
  3. annotatesamples expr -c \
  4. 'sa.myqc.pHomRef = (sa.qc.nHomRef + sa.qc.nHomVar) / sa.qc.nCalled,sa.myqc.pHet= sa.qc.nHet / sa.qc.nCalled' \
  5. exportsamples -c 'sample = s,sa.myqc.*,nHom = sa.qc.nHomRef + sa.qc.nHomVar,nHet = sa.qc.nHet,nCalled = sa.qc.nCalled' -o sampleInfo.tsv
  6. hail: info: running: read -i profile225.vds
  7. [Stage 0:> (0 + 0) / 4]SLF4J: Failed to load class "org.slf4j.impl.StaticLoggerBinder".
  8. SLF4J: Defaulting to no-operation (NOP) logger implementation
  9. SLF4J: See http://www.slf4j.org/codes.html#StaticLoggerBinder for further details.
  10. [Stage 1:============================================> (3 + 1) / 4]hail: info: running: sampleqc
  11. [Stage 2:========================================================>(64 + 1) / 65]hail: info: running: annotatesamples expr -c 'sa.myqc.pHomRef = (sa.qc.nHomRef + sa.qc.nHomVar) / sa.qc.nCalled,sa.myqc.pHet= sa.qc.nHet / sa.qc.nCalled'
  12. hail: info: running: exportsamples -c 'sample = s,nCalled = sa.qc.nCalled' -o sampleInfo.tsv
  13. hail: info: timing:
  14. read: 2.928s
  15. sampleqc: 1m27.0s
  16. annotatesamples expr: 229.653ms
  17. exportsamples: 353.942ms
  18. total: 1m30.5s

安装冰雹

安装Hail非常简单,我们有help
you
的文档.需要更多帮助吗?你可以得到
Hail用户聊天室中的实时支持,或者如果您更喜欢论坛,则为Hail
话语(两者都是从主页链接,不幸的是我没有
足够的声誉来创建真正的链接).

不久的将来

在不久的将来(距离今天不到一个月),冰雹团队将会
完成一个Python API,它允许您将第一个片段表达为:

  1. result = importvcf("YOUR_FILE.vcf.gz")
  2. .annotatesamples('sa.nCalled = gs.filter(g => g.isCalled).count(),sa.nHet = gs.filter(g => g.isHet).count()')
  3. .annotatesamples('sa.pHom = sa.nHom / sa.nCalled,sa.pHet = sa.nHet / sa.nCalled')
  4.  
  5. for (x in result.sampleannotations):
  6. print("Sample " + x +
  7. " nCalled " + x.nCalled +
  8. " nHom " + x.nHom +
  9. " nHet " + x.nHet +
  10. " percent Hom " + x.pHom * 100 +
  11. " percent Het " + x.pHet * 100)
  12.  
  13. result.sampleannotations.write("sampleInfo.tsv")

编辑:在tsv文件添加了head的输出.

编辑2:最新的冰雹不需要biallelic sampleqc

编辑3:关于使用数百个内核扩展到云的注意事项

猜你在找的设计模式相关文章