C++ OpenMP Parallel For Loop - Alternatives to std::vector

c++ openmp std

32,342

Solution 1

The question you link was talking about the fact that "that STL vector container is not thread-safe in the situation where multiple threads write to a single container" . This is only true, as stated correctly there, if you call methods that can cause reallocation of the underlying array that std::vector holds. push_back(), pop_back() and insert() are examples of these dangerous methods.

If you need thread safe reallocation, then the library intel thread building block offers you concurrent vector containers . You should not use tbb::concurrent_vector in single thread programs because the time it takes to access random elements is higher than the time std::vector takes to do the same (which is O(1)). However, concurrent vector calls push_back(), pop_back(), insert() in a thread safe way, even when reallocation happens.

EDIT 1: The slides 46 and 47 of the following Intel presentation give an illustrative example of concurrent reallocation using tbb::concurrent_vector

EDIT 2: By the way, if you start using Intel Tread Building Block (it is open source, it works with most compilers and it is much better integrated with C++/C++11 features than openmp), then you don't need to use openmp to create a parallel_for, Here is a nice example of parallel_for using tbb.

Solution 2

I think you can use std::vector with OpenMP most of the time and still have good performance. The following code for example fills std::vectors in parallel and then combines them in the end. As long as your main loop/fill function is the bottleneck this should work well in general and be thread safe.

std::vector<int> vec;
#pragma omp parallel
{
    std::vector<int> vec_private;
    #pragma omp for nowait //fill vec_private in parallel
    for(int i=0; i<100; i++) {
        vec_private.push_back(i);
    }
    #pragma omp critical
    vec.insert(vec.end(), vec_private.begin(), vec_private.end());
}

Edit:

OpenMP 4.0 allows user-defined reductions using #pragma omp declare reduction. The code above can be simplified with to

#pragma omp declare reduction (merge : std::vector<int> : omp_out.insert(omp_out.end(), omp_in.begin(), omp_in.end()))

std::vector<int> vec;
#pragma omp parallel for reduction(merge: vec)
for(int i=0; i<100; i++) vec.push_back(i);

Edit: What I have shown so far does not fill the vector in order. If the order matters then this can be done like this

std::vector<int> vec;
#pragma omp parallel
{
    std::vector<int> vec_private;
    #pragma omp for nowait schedule(static)
    for(int i=0; i<N; i++) { 
        vec_private.push_back(i);
    }
    #pragma omp for schedule(static) ordered
    for(int i=0; i<omp_get_num_threads(); i++) {
        #pragma omp ordered
        vec.insert(vec.end(), vec_private.begin(), vec_private.end());
    }
}

This avoids saving a std::vector for each thread and then merging them in serial outside of the parallel region. I learned about this "trick" here. ~~I'm not sure how to do this (or if it's even possible) for user-defined reductions.~~. It's not possible to do this with user-defined reductions.

I just realized that the critical section is not necessary which I figured out from this question parallel-cumulative-prefix-sums-in-openmp-communicating-values-between-thread. This method also gets the order correct as well

std::vector<int> vec;
size_t *prefix;
#pragma omp parallel
{
    int ithread  = omp_get_thread_num();
    int nthreads = omp_get_num_threads();
    #pragma omp single
    {
        prefix = new size_t[nthreads+1];
        prefix[0] = 0;
    }
    std::vector<int> vec_private;
    #pragma omp for schedule(static) nowait
    for(int i=0; i<100; i++) {
        vec_private.push_back(i);
    }
    prefix[ithread+1] = vec_private.size();
    #pragma omp barrier
    #pragma omp single 
    {
        for(int i=1; i<(nthreads+1); i++) prefix[i] += prefix[i-1];
        vec.resize(vec.size() + prefix[nthreads]);
    }
    std::copy(vec_private.begin(), vec_private.end(), vec.begin() + prefix[ithread]);
}
delete[] prefix;

32,342

Armin Meisterhirn

Updated on April 13, 2020

Comments

Armin Meisterhirn about 4 years

Based on this thread, OpenMP and STL vector, which data structures are good alternatives for a shared std::vector in a parallel for loop? The main aspect is speed, and the vector might require resizing during the loop.
- LihO almost 11 years
  
  Show us some code, describe your specific situation... what will be stored in the vector? What will your loop do with it? It's very probable that it will be perfectly safe to use std::vector anyway.
Hristo Iliev over 9 years

As for the question in the very last sentence: "The number of times the combiner is executed, and the order of these executions, for any reduction clause is unspecified," therefore not possible.
Admin almost 6 years

Thanks, this really helped me!
Joachim over 4 years

Just a question: if we had a nested for loop, would the code above still hold by writting: std::vector<int> vec; #pragma omp parallel { #pragma omp for collapse(2) nowait schedule(static) for(int i=0; i<N; i++) { for(int j=0; j< M; j++){ } } #pragma omp for collapse(2) schedule(static) ordered for(int i=0; i<omp_get_num_threads(); i++) { #pragma omp ordered do some stuff } }
Joachim over 4 years

Sorry for the ugly indentation
Z boson over 4 years

@Joachim, I don't know. I don't normally use nested parallelism. I don't have time to look into this either. Maybe ask a question. There are many experts here that can help you.
Joachim over 4 years

Thanks, I think I will