How to monitor progress of an apply function?
Solution 1
You can use the library pbapply
(git), which shows a time estimate and a progress bar to any function in the '*apply' family.
In the case of your question:
library(pbapply)
library(data.table)
result <- data.table::rbindlist( pbapply(COMB, 1, append) )
ps. This answer solves your two initial points. Regarding the third point, I'm not sure if it's possible to pause the function. In any case, your operation is indeed taking too long, so I would recommend you post a separate question asking how to optimize your task.
Solution 2
You can use txtProgressBar
from the utils
package:
total <- 50
pb <- txtProgressBar(min = 0, max = total, style = 3)
lapply(1:total, function(i){
Sys.sleep(0.1)
setTxtProgressBar(pb, i)
})
OR use *ply
family from plyr
package
library(plyr)
laply(1:100, function(i) {Sys.sleep(0.05); i}, .progress = "text")
Check ?create_progress_bar()
for more details
Solution 3
Try this instead:
setkey(SOURCE, NAME)
SOURCE[, CJ(NAME, NAME, unique = T)][
, mean(SOURCE[V1, VALUE] == SOURCE[V2, VALUE]), by = .(V1, V2)]
Fwiw, the all-caps names are an awful choice imo - makes writing and reading code significantly harder.
ChiseledAbs
Updated on June 06, 2022Comments
-
ChiseledAbs almost 2 years
I need to work out a 2886*2886 correlation matrix, problem is that building an intermediary datatable (
RESULT
) takes a long time for it to be binded together so I would like to be able to do the following things while calling the last lineRESULT=rbindlist(apply(COMB, 1, append))
in the code below :- Estimate the time it will take for the apply function to complete
- Monitor its progress
- Be able to pause and continue at later time
Here is the code :
SOURCE=data.table(NAME=rep(paste0("NAME", as.character(1:2889)), each=600), VALUE=sample(c(TRUE,FALSE), 600, TRUE) ) > SOURCE NAME VALUE 1: NAME1 TRUE 2: NAME1 TRUE 3: NAME1 TRUE 4: NAME1 TRUE 5: NAME1 TRUE --- 1733396: NAME999 TRUE 1733397: NAME999 TRUE 1733398: NAME999 TRUE 1733399: NAME999 TRUE 1733400: NAME999 FALSE setkey(SOURCE,NAME) a=SOURCE[,unique(NAME)] COMB=data.table(expand.grid(a,a, stringsAsFactors=FALSE)) > COMB Var1 Var2 1: NAME1 NAME1 2: NAME10 NAME1 3: NAME100 NAME1 4: NAME1000 NAME1 5: NAME1001 NAME1 --- 8346317: NAME995 NAME999 8346318: NAME996 NAME999 8346319: NAME997 NAME999 8346320: NAME998 NAME999 8346321: NAME999 NAME999 append <- function(X) { data.table(NAME1=X[1], VALUE1=SOURCE[X[1], VALUE], NAME2=X[2], VALUE2=SOURCE[X[2], VALUE] ) } RESULT=rbindlist(apply(COMB, 1, append))
Any idea ?
Also do you know if there is a faster way to generate the datatable
RESULT
fromSOURCE
?RESULT
is an intermediary datatable to work out the correlation values betweenVALUE1
andVALUE2
for each couple ofNAME
.With a subset of
SOURCE
RESULT
looks like that :SOURCE=SOURCE[sample(1:nrow(SOURCE), 3)] setkey(SOURCE,NAME) a=SOURCE[,unique(NAME)] COMB=data.table(expand.grid(a,a, stringsAsFactors=FALSE)) RESULT=rbindlist(apply(COMB, 1, append)) > RESULT NAME1 VALUE1 NAME2 VALUE2 1: NAME1859 TRUE NAME1859 TRUE 2: NAME768 FALSE NAME1859 TRUE 3: NAME795 TRUE NAME1859 TRUE 4: NAME1859 TRUE NAME768 FALSE 5: NAME768 FALSE NAME768 FALSE 6: NAME795 TRUE NAME768 FALSE 7: NAME1859 TRUE NAME795 TRUE 8: NAME768 FALSE NAME795 TRUE 9: NAME795 TRUE NAME795 TRUE
Later on I'm going to do
RESULT[,VALUE3:=(VALUE1==VALUE2)]
to finally get the correlation values :RESULT[, mean(VALUE3), by=c("NAME1", "NAME2")]
So maybe the whole process can be done more efficiently, who knows.