使用R

2024-05-05 • 问答

我有几个100个基因组区域，起始位置。我想从参考文件中提取每个区域的基因ID（参考文件包含成千上万个基因，其起始和终止位置）。我提取了基因的区域信息，例如，输出文件告诉我一个或两个区域中是否存在某个基因，依此类推。但是，每个区域包含许多基因，并且多个基因位于多个区域。我想要一个输出文件，该文件将所有基因的ID放在该区域旁边的单元格中（每个基因ID以逗号分隔）。如果一个基因存在于多个区域中，它将与该区域的其他基因一起出现在与那些区域相对应的多个细胞中。可以用R代码完成吗？请帮忙。

样本输入。

RegionInfoFile

Region  Chr Start   End
1   1A  1   12345
2   1A  23456   34567
3   2A  1234    23456
***
1830 7D 123     45678

GeneInfoFile

Gene    Chr Start   End
GeneID1 1A  831 1437
GeneID2 1A  1487    2436
GeneID3 1B  2665    5455
***
GeneID10101 7D  13456  56789

RequiredOutPutFile

Region  Chr Start   End   Gene
1       1A  1       12345 GeneID1,GeneID2,GeneID5,GeneID6 ***
***
1830    7D  123     45689 GeneID7,GeneID100,GeneID200 ***

您正在寻找的东西通常是通过Bioconductor上出色的GenomicRanges软件包来完成的。您可以从区域创建GenomicRanges对象，并使用findOverlaps()函数查找它们重叠的位置。要控制什么是重叠，请参见findOverlaps(...,type="")中的“ type”参数。这是R中的一个示例。我敢肯定，这样做的方式更优雅，但这就是我在午餐时想得到的：

# install.packages('readr')
library(readr)
# install.packages('BiocManager')
# BiocManager::install('GenomicRanges')
library(GenomicRanges)

# Read data into R. Needs to have column names "chr","start" and "end"
x <- read_csv('~/Downloads/regions.csv')
y <- read_csv('~/Downloads/genes.csv')

# Set up two GenomicRanges objects
gr_x <- makeGRangesFromDataFrame(x,keep.extra.columns = TRUE)
gr_y <- makeGRangesFromDataFrame(y,keep.extra.columns = TRUE)

# Overlap the regions containing genes with the trait data
ovl <- findOverlaps(gr_y,gr_x,type = "any",select = "all",ignore.strand = TRUE)
# Group hits by trait regions
hits_by_region <- split(from(ovl),to(ovl))

# Create a data.frame that matches a trait region index to a string of genes
hits_df <- do.call(rbind,lapply(seq_along(hits_by_region),function(f) {
  idx <- as.integer(names(hits_by_region[f]))
  genes <- gr_y$Gene[hits_by_region[[f]]]
  data.frame(Index = idx,Genes = paste(genes,collapse=','),stringsAsFactors = FALSE)
}))

# Add the string of genes as metadata to the GRanges object
gr_x$Genes <- ""
gr_x$Genes[hits_df$Index] <- hits_df$Genes

# Convert back to data.frame (if needed)
genes_by_regio_df <- as.data.frame(gr_x)

使用R

样本输入。

RegionInfoFile

GeneInfoFile

RequiredOutPutFile

tskpzx 回答：使用R

大家都在问