优化此for循环过程以避免程序崩溃

我正在努力尝试在R中执行操作而不会崩溃。我给你一个可重复的例子。我有X:

X <- data.frame(V1 = c("chr1","chr1","chr1"),Start = c(0,1001,3002,4059,6581),Stop = c(1000,3001,4058,6580,7002),A = c(10,4,5,6,9),B = c(923,39,9,93),C = c(239,2,13,5))

我想执行此操作:

for (row in 1:nrow(X)){
  X$A <- (X$A / (X[row,"Stop"] - X[row,"Start"])) * mean(X$Stop - X$Start)
  X$B <- (X$B / (X[row,"Start"])) * mean(X$Stop - X$Start)
  X$C <- (X$C / (X[row,"Start"])) * mean(X$Stop - X$Start)
}

当我的文档大得多(例如2.000.000行)时,出现我的问题。有什么方法可以在如此大的data.frame中更快地执行此操作?

zhuangmisi 回答:优化此for循环过程以避免程序崩溃

您的循环很慢,因为每次走一排,就会覆盖整个A,B和C向量。即您正在为每行写入6.000.000值(2.000.000次)

我在这里使用dplyr:

library(dplyr)

X <- X %>%
 mutate(A = (A/Stop - Start*(Stop-Start)),B = (B/Stop - Start*(Stop-Start)),C = (C/Stop - Start*(Stop-Start)) )

强烈建议您不要覆盖当前的ABC值。

library(dplyr)

X <- X %>%
 mutate(TRANSFORMED_A = (A/Stop - Start*(Stop-Start)),TRANSFORMED_B = (B/Stop - Start*(Stop-Start)),TRANSFORMED_C = (C/Stop - Start*(Stop-Start)) )
,

如果每次迭代都不打算用更新的值覆盖整个列,那么您可能还需要考虑这些解决方案。 data.table可以很好地处理大型数据集。

library(data.table)      # loads library
setDT(X)                 # converts data.frame to data.table
cols = c("A","B","C")  # defines the columns to update
avg.diff = X[,mean(Stop - Start)]   # caclulates average difference between Stop and Start
X[,(cols) := lapply(.SD,function(x) {avg.diff * x / (Stop - Start)}),.SDcols = cols]  # makes calculations

相同,但更容易理解:

X[,`:=`(
  A = avg.diff * A / (Stop - Start),B = avg.diff * B / (Stop - Start),C = avg.diff * C / (Stop - Start)
)]

可以根据您的需要随意调整这些解决方案。

本文链接:https://www.f2er.com/3159987.html

大家都在问