How do I parallelize a for loop through a C++ std::list using OpenMP?

14,186

Solution 1

If you decide to use Openmp 3.0, you can use the task feature:

#pragma omp parallel
#pragma omp single
{
  for(auto it = l.begin(); it != l.end(); ++it)
     #pragma omp task firstprivate(it)
       it->process();
  #pragma omp taskwait
}

This will execute the loop in one thread, but delegate the processing of elements to others.

Without OpenMP 3.0 the easiest way would be writing all pointers to elements in the list (or iterators in a vector and iterating over that one. This way you wouldn't have to copy anything back and avoid the overhead of copying the elements themselves, so it shouldn't have to much overhead:

std::vector<my_element*> elements; //my_element is whatever is in list
for(auto it = list.begin(); it != list.end(); ++it)
  elements.push_back(&(*it));

#pragma omp parallel shared(chunks)
{
  #pragma omp for
  for(size_t i = 0; i < elements.size(); ++i) // or use iterators in newer OpenMP
      elements[i]->process();
}

If you want to avoid copying even the pointers, you can always create a parallelized for loop by hand. You can either have the threads access interleaved elements of the list (as proposed by KennyTM) or split the range in roughly equal contious parts before iterating and iterating over those. The later seems preferable since the threads avoid accessing listnodes currently processed by other threads (even if only the next pointer), which could lead to false sharing. This would look roughly like this:

#pragma omp parallel
{
  int thread_count = omp_get_num_threads();
  int thread_num   = omp_get_thread_num();
  size_t chunk_size= list.size() / thread_count;
  auto begin = list.begin();
  std::advance(begin, thread_num * chunk_size);
  auto end = begin;
  if(thread_num = thread_count - 1) // last thread iterates the remaining sequence
     end = list.end();
  else
     std::advance(end, chunk_size);
  #pragma omp barrier
  for(auto it = begin; it != end; ++it)
    it->process();
}

The barrier is not strictly needed, however if process mutates the processed element (meaning it is not a const method), there might be some sort of false sharing without it, if threads iterate over a sequence which is already being mutated. This way will iterate 3*n times over the sequence (where n is the number of threads), so scaling might be less then optimal for a high number of threads.

To reduce the overhead you could put the generation of the ranges outside of the #pragma omp parallel, however you will need to know how many threads will form the parallel section. So you'd probably have to manually set the num_threads, or use omp_get_max_threads() and handle the case that the number of threads created is less then omp_get_max_threads() (which is only an upper bound). The last way could be handled by possibly assigning each thread severa chunks in that case (using #pragma omp for should do that):

int max_threads = omp_get_max_threads();
std::vector<std::pair<std::list<...>::iterator, std::list<...>::iterator> > chunks;
chunks.reserve(max_threads); 
size_t chunk_size= list.size() / max_threads;
auto cur_iter = list.begin();
for(int i = 0; i < max_threads - 1; ++i)
{
   auto last_iter = cur_iter;
   std::advance(cur_iter, chunk_size);
   chunks.push_back(std::make_pair(last_iter, cur_iter);
}
chunks.push_back(cur_iter, list.end();

#pragma omp parallel shared(chunks)
{
  #pragma omp for
  for(int i = 0; i < max_threads; ++i)
    for(auto it = chunks[i].first; it != chunks[i].second; ++it)
      it->process();
}

This will take only three iterations over list (two, if you can get the size of the list without iterating). I think that is about the best you can do for non random access iterators without using tasks or iterating over some out of place datastructure (like a vector of pointer).

Solution 2

I doubt it is possible since you can't just jump into the middle of a list without traversing the list. Lists are not stored in contiguous memory and std::list iterators are not Random Access. They are only bidirectional.

Solution 3

http://openmp.org/forum/viewtopic.php?f=3&t=51

#pragma omp parallel
{
   for(it= list1.begin(); it!= list1.end(); it++)
   {
      #pragma omp single nowait
      {
         it->compute();
      }
   } // end for
} // end ompparallel

This can be understood as unrolled as:

{
  it = listl.begin
  #pragma omp single nowait
  {
    it->compute();
  }
  it++;
  #pragma omp single nowait
  {
    it->compute();
  }
  it++;
...
}

Given a code like this:

int main()                                                                      
{                                                                               
        std::vector<int> l(4,0);                                                
        #pragma omp parallel for                                                        
        for(int i=0; i<l.size(); ++i){                                          
                printf("th %d = %d \n",omp_get_thread_num(),l[i]=i);            
        }                                                                       
        printf("\n");                                                           
       #pragma omp parallel                                                            
        {                                                                       
                for (auto i = l.begin(); i != l.end(); ++i) {                   
               #pragma omp single nowait                                                       
                {                                                       
                        printf("th %d = %d \n",omp_get_thread_num(),*i);
                }                                                       
            }                                                               
        }                                                                       
        return 0;                                                               
} 

export OMP_NUM_THREADS=4, output as follows (note the second section, work thread number can repeat):

th 2 = 2 
th 1 = 1 
th 0 = 0 
th 3 = 3 

th 2 = 0 
th 1 = 1 
th 2 = 2 
th 3 = 3
Share:
14,186
mshang
Author by

mshang

Updated on June 06, 2022

Comments

  • mshang
    mshang about 2 years

    I would like to iterate through all elements in an std::list in parallel fashion using OpenMP. The loop should be able to alter the elements of the list. Is there a simple solution for this? It seems that OpenMP 3.0 supports parallel for loops when the iterator is a Random Access Iterator, but not otherwise. In any case, I would prefer to use OpenMP 2.0 as I don't have full control over which compilers are available to me.

    If my container were a vector, I might use:

    #pragma omp parallel for
    for (auto it = v.begin(); it != v.end(); ++it) {
        it->process();
    }
    

    I understand that I could copy the list into a vector, do the loop, then copy everything back. However, I would like to avoid this complexity and overhead if possible.

  • Ben Voigt
    Ben Voigt over 12 years
    Well, if the per-element processing is much more expensive than iteration, then parallelization might still be desirable.
  • kennytm
    kennytm over 12 years
    You could iterate with it1 = v.begin() and it2 = it1 + 1, and then it1 += 2 and it2 += 2, if there are 2 threads of execution.
  • mshang
    mshang over 12 years
    @KennyTM, thanks. This is along the lines of what I'm looking for.
  • jmucchiello
    jmucchiello over 12 years
    But += 2 on a non-random access iterator still requires traversing the iterator one iteration at a time. I suppose this would work if the list is const during the loop. But I'm guessing you would need to do a lot of setup to get this to work. You can't say it1 += 2 because you still need to test it1 != list.end() for both increments.
  • mshang
    mshang over 12 years
    Thanks for the detailed answer.
  • user
    user about 11 years
    I want iterate over whole map. How can I iterate using OpenMp the whole map?
  • Pierluigi
    Pierluigi almost 7 years
    It should be: #pragma omp parallel private(it) so that each thread gets a copy of the iterator