doParallel "foreach" inconsistently inherits objects from parent environment: "Error in { : task 1 failed - "could not find function..."

11,800

Solution 1

@Tensibai is right. When trying to use doParallel on Windows, you have to "export" the functions that you want to use that are not in the current scope. In my experience, the way I've made this work is with the following (redacted) example.

format_number <- function(data) {
  # do stuff that requires stringr
}

format_date_time <- function(data) {
  # do stuff that requires stringr
}

add_direction_data <- function(data) {
  # do stuff that requires dplyr
}

parse_data <- function(data) {
  voice_start <- # vector of values
  voice_end <- # vector of values
  target_phone_numbers <- # vector of values
  parse_voice_block <- function(block_start, block_end, number) {
    # do stuff
  }

  number_of_cores <- parallel::detectCores() - 1
  clusters <- parallel::makeCluster(number_of_cores)
  doParallel::registerDoParallel(clusters)
  data_list <- foreach(i = 1:length(voice_start), .combine=list,
                       .multicombine=TRUE, 
                       .export = c("format_number", "format_date_time", "add_direction_data"), 
                       .packages = c("dplyr", "stringr")) %dopar% 
                       parse_voice_block(voice_start[i], voice_end[i], target_phone_numbers[i])
  doParallel::stopCluster(clusters)
  output <- plyr::rbind.fill(data_list)
}

Since the first three functions aren't included in my current environment, doParallel would ignore them when firing up the new instances of R, but it would know where to find parse_voice_block since it's within the current scope. In addition, you need to specify what packages should be loaded in each new instance of R. As Tensibai stated, this is because you're not running forking the process, but instead firing up multiple instances of R and running commands simultaneously.

Solution 2

It's rather unfortunate that when you register doParallel using:

registerDoParallel(2)

then doParallel uses mclapply on Linux and Mac OS X, but clusterApplyLB with an implicitly created cluster object on Windows. This often causes code to work on Linux but fail on Windows because the workers are clones of the master when using mclapply due to fork. For that reason, I usually test my code using:

cl <- makePSOCKcluster(2)
registerDoParallel(cl)

to make sure I'm loading all necessary packages and exporting all necessary functions and variables, and then switch back to registerDoParallel(2) to get the benefit of mclapply on platforms that support it.

Note that the .packages and .export options are ignored when doParallel uses mclapply, but I recommend always using them for portability.


The auto-export feature of foreach doesn't work quite as smoothly when using it inside a function because foreach is rather conservative about what to auto-export. It seems pretty safe to auto-export variables and functions that are defined in the current environment, but outside of that seems risky to me because of the complexity of R's scoping rules.

I tend to agree with your comment that your two work-arounds aren't very stable for an actively developed package, but if f and g are defined in package foo, then you should use the foreach .package option to load the package foo on the workers:

g <- function(){
    r = foreach(x = 1:4, .packages='foo') %dopar% {
        return(x + f())
    }
    return(r)
}

Then f will be in the scope of g even though it is neither implicitly or explicitly exported by foreach. However, this does require that f is an exported function of foo (as opposed to an internal function), since the code executed by the workers isn't defined in foo, so it can only access exported functions. (Sorry for using the term "export" in two different ways, but it's hard to avoid.)

I'm always interested to hear comments such as yours because I'm always wondering if the auto-export rules should be tweaked. In this case, I'm thinking that if a foreach loop is executed by a function that is defined in a package, the cluster workers should auto-load that package without the need for the .packages option. I'll try to look into that and perhaps add this to the next release of doParallel and doSNOW.

Share:
11,800

Related videos on Youtube

sssheridan
Author by

sssheridan

Updated on September 14, 2022

Comments

  • sssheridan
    sssheridan over 1 year

    I have a problem with foreach that I just can't figure out. The following code fails on two Windows computers I've tried, but succeeds on three Linux computers, all running the same versions of R and doParallel:

    library("doParallel")
    registerDoParallel(cl=2,cores=2)
    
    f <- function(){return(10)}
    g <- function(){
        r = foreach(x = 1:4) %dopar% {
            return(x + f())
        }
        return(r)
    }
    g()
    

    On these two Windows computers, the following error is returned:

    Error in { : task 1 failed - "could not find function "f""
    

    However, this works just fine on the Linux computers, and also works just fine with %do% instead of %dopar%, and works fine for a regular for loop.

    The same is true with variables, e.g. setting i <- 10 and replacing return(x + f()) with return(x + i)

    For others with the same problem, two workarounds are:

    1) explicitly import the needed functions and variables with .export:

    r = foreach(x=1:4, .export="f") %dopar% 
    

    2) import all global objects:

    r = foreach(x=1:4, .export=ls(.GlobalEnv)) %dopar% 
    

    The problem with these workarounds is that they aren't the most stable for a big, actively developing package. In any case, foreach is supposed to behave like for.

    Any ideas of what's causing this and if there's a fix?


    Version info of the computer that the function works on:

    R version 3.2.2 (2015-08-14)
    Platform: x86_64-pc-linux-gnu (64-bit)
    Running under: CentOS release 6.5 (Final)
    
    other attached packages:
    [1] doParallel_1.0.10 iterators_1.0.8   foreach_1.4.3
    

    The computer the function doesn't work on:

    R version 3.2.2 (2015-08-14)
    Platform: x86_64-w64-mingw32/x64 (64-bit)
    Running under: Windows 7 x64 (build 7601) Service Pack 1
    
    other attached packages:
    [1] doParallel_1.0.10 iterators_1.0.8   foreach_1.4.3  
    
    • tblznbits
      tblznbits over 8 years
      Where is the f() function in your example code? Based on what you've provided, it seems as though the Windows machine is giving the right error as f is not a function, but instead a number.
  • Steve Weston
    Steve Weston over 8 years
    You should use stopCluster(clusters) rather than stopImplicitCluster() since you're explicitly calling makeCluster. Also, using .combine=list is going to give you lists inside of lists if you have more than 100 tasks.
  • tblznbits
    tblznbits over 8 years
    @SteveWeston Literally 6 days after you tell me this, I'm running a code with doParallel and trying to plyr::rbind.fill(data_list) at the end and it's failing. I can't figure out why for about an hour and then I finally realize it's because of this comment. So thanks again for the heads up.
  • Brandon Bertelsen
    Brandon Bertelsen over 6 years
    You can also export the variables after you make your cluster using clusterExport(cl, c("list","of","variables")). If it's in a function remember to set envir = environment() because it clusterExport default is globalenv