Why does adding concurrency slow down this golang code?

concurrency go

12,375

Solution 1

The issue seems to come from your use of rand.Float64(), which uses a shared global object with a Mutex lock on it.

Instead, if for each CPU you create a separate rand.New(), pass it through to the interactions(), and use it to create the Float64(), there's a massive improvement.

Update to show the changes to the new example code in the question that now uses rand.New()

The test() function was modified to either use a given channel, or return the result.

func test(n int, c chan []int) []int {
    source := rand.NewSource(time.Now().UnixNano())
    generator := rand.New(source)
    simulations := make([]int, n)
    for i := range simulations {
        for _, v := range simulation(NUMBER_OF_INTERACTIONS, generator) {
            simulations[i] += v
        }   
    }   
    if c == nil {
        return simulations
    }   
    c <- simulations
    return nil 
}

The main() function was updated to run both tests, and output the timed result.

func main() {
    rand.Seed(time.Now().UnixNano())

    nCPU := runtime.NumCPU()
    runtime.GOMAXPROCS(nCPU)
    fmt.Println("Number of CPUs: ", nCPU)

    start := time.Now()
    fmt.Println("Successful interactions: ", len(test(NUMBER_OF_SIMULATIONS, nil)))
    fmt.Println(time.Since(start))

    start = time.Now()
    tests := make([]chan []int, nCPU)
    for i := range tests {
        c := make(chan []int)
        go test(NUMBER_OF_SIMULATIONS/nCPU, c)
        tests[i] = c
    }

    // Concatentate the test results
    results := make([]int, NUMBER_OF_SIMULATIONS)
    for i, c := range tests {
        start := (NUMBER_OF_SIMULATIONS/nCPU) * i
        stop := (NUMBER_OF_SIMULATIONS/nCPU) * (i+1)
        copy(results[start:stop], <-c)
    }
    fmt.Println("Successful interactions: ", len(results))
    fmt.Println(time.Since(start))
}

The output is I received:

> Number of CPUs:  2 
>
> Successful interactions:  1000 
> 1m20.39959s
>
> Successful interactions:  1000
> 41.392299s

Solution 2

Testing your code on my Linux quad core i7 laptop I get this

Here is a Google Spreadsheet

Screenshot of Google Spreadsheet

This shows that under Linux at least the scaling is very nearly linear per core.

I think there may be two reasons why you aren't seeing this.

The first is that your macbook air only has 2 real cores. It has 4 hyperthreads though which is why it reports 4 as max cpus. A hyperthread typically only gives an extra 15% performance over a single core rather than the 100% you might expect. So stick to benchmarking 1 or 2 CPUs only on the macbook air!

The other reason might be OS X thread performance compared to Linux. They use different threading models which may be affecting performance.

Solution 3

Your code is sampling a binomial random variable, B(N, p) where N is the number of trials (here 1M), and p is the probability of a successful individual trial (here 0.0003).

One way to do this is to build a table T of cumulative probabilities, where T[i] contains the probability that the total number of trials is less than or equal to i. To then produce a sample, you can pick a uniform random variable (via rand.Float64) and find the first index in the table that contains a probability greater than or equal to it.

It's a little more complicated here because you've got a really big N and a fairly small p, so if you try to build the table you get trouble with really small numbers and arithmetic accuracy. But you can build a smaller table (say 1000 large) and sample it 1000 times to get your 1 million trials.

Here's some code that does all this. It's not too elegant (1000 is hard-coded in), but it generates 1000 simulations in less than a second on my old laptop. It's easy to optimise further, by for example lifting the construction of the BinomialSampler out of the loop, or by using binary search rather than a linear scan to find the table index.

package main

import (
    "fmt"
    "math"
    "math/rand"
)

type BinomialSampler []float64

func (bs BinomialSampler) Sample() int {
    r := rand.Float64()
    for i := 0; i < len(bs); i++ {
        if bs[i] >= r {
            return i
        }
    }
    return len(bs)
}

func NewBinomialSampler(N int, p float64) BinomialSampler {
    r := BinomialSampler(make([]float64, N+1))
    T := 0.0
    choice := 1.0
    for i := 0; i <= N; i++ {
        T += choice * math.Pow(p, float64(i)) * math.Pow(1-p, float64(N-i))
        r[i] = T
        choice *= float64(N-i) / float64(i+1)
    }
    return r
}

func WowSample(N int, p float64) int {
    if N%1000 != 0 {
        panic("N must be a multiple of 1000")
    }
    bs := NewBinomialSampler(1000, p)
    r := 0
    for i := 0; i < N; i += 1000 {
        r += bs.Sample()
    }
    return r
}

func main() {
    for i := 0; i < 1000; i++ {
        fmt.Println(WowSample(1000000, 0.0003))
    }
}

Solution 4

My results, which show substantial concurrency for 4 CPUs versus 1 CPU:

Intel Core 2 Quad CPU Q8300 @ 2.50GHz x 4

Source code: UPDATE (01/12/13 18:05)

$ go version
go version devel +adf4e96e9aa4 Thu Jan 10 09:57:01 2013 +1100 linux/amd64

$ time  go run temp.go
Number of CPUs:  1
real    0m30.305s
user    0m30.210s
sys     0m0.044s

$ time  go run temp.go
Number of CPUs:  4
real    0m9.980s
user    0m35.146s
sys     0m0.204s

View more solutions

12,375

Christopher Roach

Updated on June 06, 2022

Comments

Christopher Roach almost 2 years
I've got a bit of Go code that I've been tinkering with to answer a little curiosity of mine related to a video game my brother-in-law plays.

Essentially, the code below simulates interactions with monsters in the game and how often he can expect them to drop items upon their defeat. The problem I'm having is that I would expect a piece of code like this to be perfect for parallelization, but when I add in concurrency the time it takes to do all of the simulations tends to slow down by 4-6 times the original without concurrency.

To give you a better understanding of how the code works, I have three main functions: The interaction function which is a simple interaction between the player and a monster. It returns 1 if the monster drops an item, and 0 otherwise. The simulation function runs several interactions and returns a slice of interaction results (i.e., 1's and 0's representing successful/unsuccessful interactions). Finally, there is the test function which runs a set of simulations and returns a slice of simulation results which are the total number of interactions that resulted in a dropped item. It's the last function which I am trying to run in parallel.

Now, I could understand why the code would slow down if I created a goroutine for each test that I want to run. Assuming I'm running 100 tests, the context switching between each of the goroutines across the 4 CPUs my MacBook Air has would kill the performance, but I'm only creating as many goroutines as I have processors and dividing the number of tests between the goroutines. I would expect this to actually speed up the code's performance since I am running each of my tests in parallel, but, of course, I'm getting a major slow down instead.

I'd love to figure out why this is happening, so any help would be greatly appreciated.

Below is the regular code without the go routines:
```
package main

import (
    "fmt"
    "math/rand"
    "time"
)

const (
    NUMBER_OF_SIMULATIONS = 1000
    NUMBER_OF_INTERACTIONS = 1000000
    DROP_RATE = 0.0003
)

/**
 * Simulates a single interaction with a monster
 *
 * Returns 1 if the monster dropped an item and 0 otherwise
 */
func interaction() int {
    if rand.Float64() <= DROP_RATE {
        return 1
    }
    return 0
}

/**
 * Runs several interactions and retuns a slice representing the results
 */
func simulation(n int) []int {
    interactions := make([]int, n)
    for i := range interactions {
        interactions[i] = interaction()
    }
    return interactions
}

/**
 * Runs several simulations and returns the results
 */
func test(n int) []int {
    simulations := make([]int, n)
    for i := range simulations {
        successes := 0
        for _, v := range simulation(NUMBER_OF_INTERACTIONS) {
            successes += v
        }
        simulations[i] = successes
    }
    return simulations
}

func main() {
    rand.Seed(time.Now().UnixNano())
    fmt.Println("Successful interactions: ", test(NUMBER_OF_SIMULATIONS))
}
```
And, here is the concurrent code with the goroutines:
```
package main

import (
    "fmt"
    "math/rand"
    "time"
    "runtime"
)

const (
    NUMBER_OF_SIMULATIONS = 1000
    NUMBER_OF_INTERACTIONS = 1000000
    DROP_RATE = 0.0003
)

/**
 * Simulates a single interaction with a monster
 *
 * Returns 1 if the monster dropped an item and 0 otherwise
 */
func interaction() int {
    if rand.Float64() <= DROP_RATE {
        return 1
    }
    return 0
}

/**
 * Runs several interactions and retuns a slice representing the results
 */
func simulation(n int) []int {
    interactions := make([]int, n)
    for i := range interactions {
        interactions[i] = interaction()
    }
    return interactions
}

/**
 * Runs several simulations and returns the results
 */
func test(n int, c chan []int) {
    simulations := make([]int, n)
    for i := range simulations {
        for _, v := range simulation(NUMBER_OF_INTERACTIONS) {
            simulations[i] += v
        }
    }
    c <- simulations
}

func main() {
    rand.Seed(time.Now().UnixNano())

    nCPU := runtime.NumCPU()
    runtime.GOMAXPROCS(nCPU)
    fmt.Println("Number of CPUs: ", nCPU)

    tests := make([]chan []int, nCPU)
    for i := range tests {
        c := make(chan []int)
        go test(NUMBER_OF_SIMULATIONS/nCPU, c)
        tests[i] = c
    }

    // Concatentate the test results
    results := make([]int, NUMBER_OF_SIMULATIONS)
    for i, c := range tests {
        start := (NUMBER_OF_SIMULATIONS/nCPU) * i
        stop := (NUMBER_OF_SIMULATIONS/nCPU) * (i+1)
        copy(results[start:stop], <-c)
    }

    fmt.Println("Successful interactions: ", results)
}
```
UPDATE (01/12/13 18:05)

I've added a new version of the concurrent code below that creates a new Rand instance for each goroutine per "the system"'s suggestion below. I'm now seeing a very slight speed up compared to the serial version of the code (around a 15-20% reduction in overall time taken). I'd love to know why I don't see something closer to a 75% reduction in time since I'm spreading the workload over my MBA's 4 cores. Does anyone have any further suggestions that could help out?
```
package main

import (
    "fmt"
    "math/rand"
    "time"
    "runtime"
)

const (
    NUMBER_OF_SIMULATIONS = 1000
    NUMBER_OF_INTERACTIONS = 1000000
    DROP_RATE = 0.0003
)

/**
 * Simulates a single interaction with a monster
 *
 * Returns 1 if the monster dropped an item and 0 otherwise
 */
func interaction(generator *rand.Rand) int {
    if generator.Float64() <= DROP_RATE {
        return 1
    }
    return 0
}

/**
 * Runs several interactions and retuns a slice representing the results
 */
func simulation(n int, generator *rand.Rand) []int {
    interactions := make([]int, n)
    for i := range interactions {
        interactions[i] = interaction(generator)
    }
    return interactions
}

/**
 * Runs several simulations and returns the results
 */
func test(n int, c chan []int) {
    source := rand.NewSource(time.Now().UnixNano())
    generator := rand.New(source)
    simulations := make([]int, n)
    for i := range simulations {
        for _, v := range simulation(NUMBER_OF_INTERACTIONS, generator) {
            simulations[i] += v
        }
    }
    c <- simulations
}

func main() {
    rand.Seed(time.Now().UnixNano())

    nCPU := runtime.NumCPU()
    runtime.GOMAXPROCS(nCPU)
    fmt.Println("Number of CPUs: ", nCPU)

    tests := make([]chan []int, nCPU)
    for i := range tests {
        c := make(chan []int)
        go test(NUMBER_OF_SIMULATIONS/nCPU, c)
        tests[i] = c
    }

    // Concatentate the test results
    results := make([]int, NUMBER_OF_SIMULATIONS)
    for i, c := range tests {
        start := (NUMBER_OF_SIMULATIONS/nCPU) * i
        stop := (NUMBER_OF_SIMULATIONS/nCPU) * (i+1)
        copy(results[start:stop], <-c)
    }

    fmt.Println("Successful interactions: ", results)
}
```
UPDATE (01/13/13 17:58)

Thanks everyone for the help in figuring out my problem. I did finally get the answer I was looking for and so I thought I would just summarize here for anyone who has the same problem.

Essentially I had two main issues: first, even though my code was embarrassingly parallel, it was running slower when I split it up amongst the available processors, and second, the solution opened up another issue, which was my serial code was running twice as slow as the concurrent code running on single processor, which you would expect to be roughly the same . In both cases the issue was the random number generator function rand.Float64. Basically, this is a convenience function provided by the rand package. In that package, a global instance of the Rand struct is created and used by each of the convenience functions. This global Rand instance has a mutex lock associated with it. Since I was using this convenience function, I wasn't truly able to parallelize my code since each of the goroutines would have to line up for access to the global Rand instance. The solution (as "the system" suggests below) is to create a separate instance of the Rand struct for each goroutine. This solved the first problem but created the second one.

The second problem was that my non-parallel concurrent code (i.e., my concurrent code running with only a single processor) was running twice as fast as the sequential code. The reason for this was that, even though I was only running with a single processor and a single goroutine, that goroutine had its own instance of the Rand struct that I had created, and I had created it without the mutex lock. The sequential code was still using the rand.Float64 convenience function which made use of the global mutex protected Rand instance. The cost of acquiring that lock was causing the sequential code to run twice as slow.

So, the moral of the story is, whenever performance matters, make sure you create an instance of the Rand struct and call the function you need off of it rather than using the convenience functions provided by the package.
- Paul Hankin over 11 years
  
  Using a different algorithm can produce 1000 simulations of 1000000 interactions in less than a second (details in my answer below). While it doesn't answer your question about concurrency, it does solve your problem massively more efficiently.
Christopher Roach over 11 years

Thanks for the tip, I updated the code to create an instance of Rand for each goroutine and pass it through to the interaction function and it does seem to have sped up the concurrent code. I'm still not getting major speed ups though. I kind of expected to see something close to a 4x reduction in time (since I have 4 cores on my machine) but instead, I'm only seeing about a 1.2x reduction in time.
Christopher Roach over 11 years

I went ahead and added the new code with your suggested changes to the question above. Feel free to take and look and let me know if I did anything wrong.
Christopher Roach over 11 years

Also, I've been playing around with the code a bit more and it seems that the best speed up I see is when I set the number of CPUs to 1 instead of using the runtime.NumCPU function to determine the correct amount. When I do this, I see the time taken reduced to roughly 1/2 of the serial code's time. This is closer to what I had hoped to see with the work distributed over the 4 cores, but it's weird that I would see this reduction in time when lowering the number of available CPUs. Any ideas why this would be the case?
the system over 11 years

@ChristopherRoach: When I was testing, I had a 40-50% reduction on my dual-core laptop, but then I had also changed a few things with respect to how the channels were used. Your use of rand is nearly identical to what I had. I'm going to mess around with your updated example, and see how it runs on my machine.
the system over 11 years

@ChristopherRoach: Yeah, I'm not sure why you're not getting a better result. I updated my answer to show the changes I made, which don't really change your code at all. I just made it easy to run and time both versions. The version that uses goroutines is about twice as fast for me.
Christopher Roach over 11 years

Well, the changes you suggested did work, the problem was with how I was timing the code. I was using the time program on my mac for timing which was giving me some bad results (probably my own fault), so when I switched to using the time.Now() function for timing instead, I started seeing the (approximately) 4x decrease in execution time that I was expecting.
Christopher Roach over 11 years

I'm still seeing one more oddity that I'd love your feedback on; basically, when I run the original (serial) code, with 1K simulations and 1M interactions, I get about a 45s execution time. When I run the concurrent code with the MAXPROCS set to 1, I would expect to see roughly the same execution time as the serial code, but instead I'm seeing a decrease of 50% (~22s total execution time). The only difference that I can see between the two is the use of a goroutine to run the test function, but since I'm only using a single CPU shouldn't the execution time be roughly the same?
Christopher Roach over 11 years

Thanks Nick, I'm actually seeing similar performance to what you list above. It looks like I wasn't timing the code properly when I reported my findings before. That said, I would like some suggestions on why I'm seeing such a drastic difference between my purely serial code and my concurrent code with a single processor (see my last comment in the answer above). So, any suggestions you might have would be greatly appreciated. Cheers.
Christopher Roach over 11 years

Thanks PeterSO, I switched to Ubuntu to run the code and started seeing the same thing, so it looks like I was timing the code incorrectly on OS X. Everything seems to be working the way I would have expected it to now.
the system over 11 years

@ChristopherRoach: Using the code above, I get very similar execution times when GOMAXPROCS is 1. Could simply be something in your system's architecture (like the hyperthreading mentioned below) that goroutines can take advantage of even with one processor set. You could disable the hyperthreading on your machine to see if it makes a difference. Mine is an AMD Athlon II, which doesn't have that feature.
Christopher Roach over 11 years

Good point about the hyperthreading. I turned it off and even tried turning off all but one of my processors, and, while it did slow both programs down slightly, I still see the same roughly 50% difference between the serial code and concurrent code with MAXPROCESSORS set to 1. I'd love to know why, so I think I may open up another separate question to explore that a bit further. I did notice that at all times the number of threads that both programs use is never less than 2, so maybe there's a reason tied to that...
Christopher Roach over 11 years

One more question for you. Now that I think about it, the shared state of the random number generator seems obvious given that it's only pseudo-random and therefore completely determinstic, but how did you know about the mutex lock on the shared state? Was that just a reasonable assumption given that you know the algorithm must be deterministic or did you find documentation on that somewhere? And, if so, could you point me to those docs? Thanks.
the system over 11 years

@ChristopherRoach: With respect to the threading, probably a good idea to open another question. It's very interesting that you're experiencing the difference. With respect to the mutex, I looked up the docs for the rand.Float64(), and clicked on the method name to go to the source, and then followed it to the globalRand, and finally the lockedSource, which has the Mutex.
Nick Craig-Wood over 11 years

The difference is entirely down to the random number generator. If you put var source = rand.NewSource(time.Now().UnixNano()) and var generator = rand.New(source) at the top of the original source code, and replace the call with generator.Float64() you'll see the original code will take exactly the same time as the concurrent code with maxCpus = 1. I don't know why there is a difference between them though!
Christopher Roach over 11 years

I tried what you suggested and I'm now seeing equal times between the serial and concurrent (MAXPROCESSORS=1) code. I took a look at the Go source and noticed that the globalRand object used by the rand.Float64 function is using a locked source (as "the system" suggested in his answer above). I copied that code into my serial code example and tried it w and w/o the calls to aquire the lock on the source object, and that made all the difference. Looks like the cost of aquiring the lock is what's adding the additional time to my sequential code example. Mystery solved! Cheers!
Christopher Roach over 11 years

You and Nick (see below) both helped me figure out the difference in time between the serial and concurrent (MAXPROCESSORS=1) code examples. Basically, in my concurrent code, I was creating a new instance of the Rand struct (as you suggested in your answer) that didn't have the mutex lock on the source object so it didn't have the extra overhead of acquiring that lock. The serial code did have that extra overhead which was making it run twice as slow as the concurrent code. Thanks so much for your help, this has been an extremely interesting and enlightening conversation!
the system over 11 years

@ChristopherRoach: Ah, that makes sense. I was using the same code that used the new Rand instance for testing both versions. Glad you got it figured out.