reading global variables using foreach in R

11,418

The doParallel package will auto-export variables to the workers that are referenced in the foreach loop. If you don't want it to do that, you can use the foreach ".noexport" option to prevent it from auto-exporting particular variables. But if I understand you correctly, your problem is that R is subsequently duplicating some of those variables, which is even more of problem than usual since it is happening in multiple processes on a single machine.

There isn't a way to declare a variable so that R will never make a duplicate of it. You either need to replace the problem variables with objects from a package like bigmemory so that copies are never made, or you can try modifying the code in such a way as to not trigger the duplication. You can use the tracemem function to help you, since it will print a message whenever that object is duplicated.

However, you may be able to avoid the problem by reducing the data that is needed by the workers. That reduces the amount of data that needs to be copied to each of the workers, as well as decreasing their memory footprint.

Here is a classic example of giving the workers more data than they need:

x <- matrix(1:100, 10)
foreach(i=1:10, .combine='c') %dopar% {
    mean(x[,i])
}

Since the matrix x is referenced in the foreach loop, it will be auto-exported to each of the workers, even though each worker only needs a subset of the columns. The simplest solution is to iterate over the actual columns of the matrix rather than over column indices:

foreach(xc=x, .combine='c') %dopar% {
    mean(xc)
}

Not only is less data transferred to the workers, but each of the workers only actually needs to have one column in memory at a time, which greatly decreases its memory footprint for large matrices. The xc vector may still end up being duplicated, but it doesn't hurt nearly as much because it is much smaller than x.

Note that this technique only helps when doParallel uses the "snow-derived" functions, such as parLapply and clusterApplyLB, not when using mclapply. Using this technique can make the loop a bit slower when mclapply is used, since all of the workers get the matrix x for free, so why transfer around the columns when the workers already have the entire matrix? However, on Windows, doParallel can't use mclapply, so this technique is very important.

The important thing is to think about what data is really needed by the workers in order to perform their work and to try to decrease it if possible. Sometimes you can do that by using special iterators, either from the iterators or itertools packages, but you may also be able to do that by changing your algorithm.

Share:
11,418
user2272413
Author by

user2272413

Updated on June 16, 2022

Comments

  • user2272413
    user2272413 almost 2 years

    I am trying to run a foreach loop on a windows server with a 16 core CPU and 64 GB of RAM using RStudio. (using the doParallel package)

    The "worker" processes copy over all the variables from outside the for loop (observed by watching the instantiation of these processes in windows task manager when the foreach loop is run), thus bloating up the memory used by each process. I tried to declare some of the especially large variables as global, while ensuring that these variables were also read from, and not written to, inside the foreach loop to avoid conflicts. However, the processes still quickly use up all available memory.

    Is there a mechanism to ensure that the "worker" processes do not create copies of some of the "read-only" variables? Such as a specific way to declare such variables?