R / data.table：优化“递归”分组依据

我正在处理带有基因组数据的大型data.table（1e6-10e6行，10列）。我想通过将组减少为单行来减少数据。这种减少取决于多列，但要连续进行。示例数据如下：

dt.tmp <- data.table(str1=paste0("A",sample(1:100,2000,replace=TRUE)),str2=paste0("B",sample(1:5,c1=sample(1:3,replace=T),c2=sample(1:3,d1=sample(1:2,d2=sample(1:2,replace=TRUE))

对于此数据，我想使用以下步骤在 str1 列上进行缩减：

在 str1 定义的组内，基于 str2 创建组并选择最大的组
在结果组中，选择最大（c1 + c2）个组
在结果组中，选择最大（d1 + d2）的组
在结果组中选择一个随机行

我已经尝试过对.SD进行各种组合操作，例如：

dt.tmp[,':='(c=c1+c2,d=d1+d2,rnd=sample.int(.N))
    ][,':='(n=.N),by=.(str1,str2)
    ][,.SD[n==max(n),.SD[c==max(c),.SD[d==max(d),.SD[rnd==max(rnd)],by=d],by=c],by=n],by=str1];

最后一次尝试使用.SD进行最小化：

dt.tmp[,rnd=sample.int(.N))
     ][,':='(n=.N,cmaxidx=(c==max(c))),str2)
     ][,':='(nmaxidx=(n==max(n))),by=str1
     ][,':='(dmaxidx=(d==max(d))),str2,c)
     ][,.SD[dmaxidx&cmaxidx&nmaxidx
     ][rnd==max(rnd)],':='(c=NULL,d=NULL,nmaxidx=NULL,cmaxidx=NULL,dmaxidx=NULL,n=NULL,rnd=NULL)][,.SD]

（后面的操作只是清理和打印）
我根本不是“深入” data.table。我是否可以对上述问题/代码进行明显的优化以减少执行时间（目前，我需要200-300个CPU小时，而使用最多24个内核的服务器上的时间却要减少到14个时钟小时左右）。
实际数据如下：

Classes 'data.table' and 'data.frame':  50259993 obs. of  26 variables:
 $ BC         : chr  "AAAAAAAAAAAACAAGGTCG" "AAAAAAAAAAAactaccGTG" "AAAAAAAAAAAAGCactGAG" "AAAAAAAAAAAAGCactGAG" ...
 $ chrom      : chr  "chr2L" "chr2R" "chr2R" "chr2R" ...
 $ start      : int  22371281 12477441 8323580 8323580 17304870 31837917 24897443 22469324 22469324 18294732 ...
 $ end        : int  22371463 12477734 8323924 8323924 17305040 31838183 24897665 22469723 22469723 18295044 ...
 $ strand     : chr  "+" "+" "-" "-" ...
 $ MAPQ1      : int  1 40 42 42 42 42 24 1 1 42 ...
 $ MAPQ2      : int  1 40 42 42 42 42 24 1 1 42 ...
 $ AS1        : int  -3 -33 0 -3 -12 -6 -39 0 0 0 ...
 $ AS2        : int  -12 -3 -18 -15 0 0 -3 -5 -20 -6 ...
 $ XS1        : num  -3 NA NA NA NA NA NA 0 0 NA ...
 $ XS2        : num  -12 NA NA NA NA NA NA 0 -15 NA ...
 $ snP_ABS_POS: chr  "22371329,22371329,22371356,22371437" "12477460,12477500,12477524,12477707,12477719" "8323582,8323583,8323588,8323750,8323759,8323791,8323868,8323878" "8323582,8323878" ...
 $ snP_REL_POS: chr  "48,48,75,156" "19,59,83,266,278" "2,3,8,170,179,211,288,298" "2,298" ...
 $ snP_ID     : chr  ".,.,." ".,." ...
 $ snP_SEQ    : chr  "CCCTTCATCGCACGAATGTGTGCGT,CCCTTCATCGCACGAATGTGAGCGT,A,T" "T,G,accGGCATCCATCCATCCAT,T,C" "T,ACG,C,T" ...
 $ snP_VAR    : chr  "-3,-3,0" "0,-1,-2,0" "1,1,-1" "1,-1" ...
 $ snP_PARENT : chr  "unexpected,unexpected,expected,expected" "expected,non_parental_allele,unread,non_parental_allele" "expected,non_parental_allele" ...
 $ snP_TYPE   : chr  "indel,indel,snp,snp" "snp,snp" ...
 $ snP_SUBTYPE: chr  "del,del,ts,tv" "tv,tv,ts" "tv,ins,tv" ...
 - attr(*,".internal.selfref")=<externalptr> 
 - attr(*,"sorted")= chr  "BC" "chrom" "start" "end"

其中BC = str1，chrom + start + end = str2，MAPQ1 / 2 = c1 / 2，AS1 / 2 = d1 / 2。该数据减少到大约20e6行。

输入数据按色度，开始，结束进行排序。有使用特定顺序的有利方法吗？
我是否正确地认为使用.SD需要更多内存（尽管内存并不是atm真正的问题），因此不是最佳选择吗？

任何帮助和指点将不胜感激。

SessionInfo：

R version 3.6.1 (2019-07-05)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 16.04.6 LTS

Matrix products: default
BLAS:   /usr/lib/libblas/libblas.so.3.6.0
LAPACK: /usr/lib/lapack/liblapack.so.3.6.0

locale:
[1] C

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] data.table_1.12.2

loaded via a namespace (and not attached):
[1] compiler_3.6.1    R.methodsS3_1.7.1 R.utils_2.8.0     R.oo_1.22.0

将其分解为步骤：

# Within group defined by str1 create groups based on str2 and select the largest group(s)
combinations2keep <- dt.tmp[,.N,by = .(str1,str2)
                            ][,.SD[N == max(N)],by = str1
                              ][,!"N"]
dt.tmp <- dt.tmp[combinations2keep,on = .(str1,str2)]

# In resulting group(s) select group(s) with max (c1+c2)
dt.tmp <- dt.tmp[,.SD[c1+c2 == max(c1+c2)],by = str1]

# In resulting group(s) select group(s) with max (d1+d2)
dt.tmp <- dt.tmp[,.SD[d1+d2 == max(d1+d2)],by = str1]

# In resulting group(s) select a random row
dt.tmp <- dt.tmp[,.SD[sample(.N,size = 1)],by = str1]

压缩成一条链：

dt.tmp[dt.tmp[,str2)][,by = str1],str2)
       ][,by = str1
         ][,by = str1
           ][,by = str1
             ][,!"N"]

@sindri_baldur：我对您的答案做了进一步的优化。在大约一半的情况下，第一个分组给出的分组只有一行。通过将第一个分组划分为一行，其余的分组，则无需再对一半数据进行分组。节省了10-20％的计算时间

dt.tmp.N <- dt.tmp[,by = .(BC,chrom,start,end)
                   ][,by = BC]
dt.tmp.1 <- dt.tmp[dt.tmp.N[N==1],on = .(BC,1)],by = BC][,!"N"]
dt.tmp.Ng1 <- dt.tmp[dt.tmp.N[N>1],end) 
                     ][,.SD[MAPQ1+MAPQ2 == max(MAPQ1+MAPQ2)],by = BC
                       ][,.SD[AS1+AS2 == max(AS1+AS2)],by = BC
                         ][,by = BC
                           ][,!"N"]
rbindlist(list(dt.tmp.1,dt.tmp.Ng1))

（附言；我试图以此作为注释，但太大了）

R / data.table：优化“递归”分组依据

guxiaoke0510 回答：R / data.table：优化“递归”分组依据

大家都在问