About unique_ptr performances

c++ gcc c++11 unique-ptr

10,057

Solution 1

All you did in the timed blocks is access them. That won't involve any additional overhead at all. The increased time probably comes from the console output scrolling. You can never, ever do I/O in a timed benchmark.

And if you want to test the overhead of ref counting, then actually do some ref counting. How is the increased time for construction, destruction, assignment and other mutating operations of shared_ptr going to factor in to your time at all if you never mutate shared_ptr?

Edit: If there's no I/O then where are the compiler optimizations? They should have nuked the whole thing. Even ideone junked the lot.

Solution 2

UPDATED on Jan 01, 2014

I know this question is pretty old, but the results are still valid on G++ 4.7.0 and libstdc++ 4.7. So, I tried to find out the reason.

What you're benchmarking here is the dereferencing performance using -O0 and, looking at the implementation of unique_ptr and shared_ptr, your results are actually correct.

unique_ptr stores the pointer and the deleter in a ::std::tuple, while shared_ptr stores a naked pointer handle directly. So, when you dereference the pointer (using *, ->, or get) you have an extra call to ::std::get<0>() in unique_ptr. In contrast, shared_ptr directly returns the pointer. ~~On gcc-4.7 even when optimized and inlined, ::std::get<0>() is a bit slower than the direct pointer.~~. When optimized and inlined, gcc-4.8.1 fully omits the overhead of ::std::get<0>(). On my machine, when compiled with -O3, the compiler generates exactly the same assembly code, which means they are literally the same.

All in all, using the current implementation, shared_ptr is slower on creation, moving, copying and reference counting, but equally as fast *on dereferencing*.

NOTE: print() is empty in the question and the compiler omits the loops when optimized. So, I slightly changed the code to correctly observe the optimization results:

#include <iostream>
#include <string>
#include <memory>
#include <chrono>
#include <vector>

using namespace std;

class Print {
 public:
  void print() { i++; }

  int i{ 0 };
};

void test() {
  typedef vector<shared_ptr<Print>> sh_vec;
  typedef vector<unique_ptr<Print>> u_vec;

  sh_vec shvec;
  u_vec uvec;

  // can't use initializer_list with unique_ptr
  for (int var = 0; var < 100; ++var) {
    shvec.push_back(make_shared<Print>());
    uvec.emplace_back(new Print());
  }

  //-------------test shared_ptr-------------------------
  auto time_sh_1 = std::chrono::system_clock::now();

  for (auto var = 0; var < 1000; ++var) {
    for (auto it = shvec.begin(), end = shvec.end(); it != end; ++it) {
      (*it)->print();
    }
  }

  auto time_sh_2 = std::chrono::system_clock::now();

  cout << "test shared_ptr : " << (time_sh_2 - time_sh_1).count()
       << " microseconds." << endl;

  //-------------test unique_ptr-------------------------
  auto time_u_1 = std::chrono::system_clock::now();

  for (auto var = 0; var < 1000; ++var) {
    for (auto it = uvec.begin(), end = uvec.end(); it != end; ++it) {
      (*it)->print();
    }
  }

  auto time_u_2 = std::chrono::system_clock::now();

  cout << "test unique_ptr : " << (time_u_2 - time_u_1).count()
       << " microseconds." << endl;
}

int main() { test(); }

NOTE: That is not a fundamental problem and can be easily fixed by discarding the use of ::std::tuple in current libstdc++ implementation.

Solution 3

You're not testing anything useful here.

What you are talking about: copy

What you are testing: iteration

If you want to test copy, you actually need to perform a copy. Both smart pointers should have similar performance when it comes to reading, because good shared_ptr implementations will keep a local copy of the object pointed to.

EDIT:

Regarding the new elements:

It's not even worth talking about speed when using debug code, in general. If you care about performance, you will use release code (-O2 in general) and thus that's what should be measured, as there can be significant differences between debug and release code. Most notably, inlining of template code can seriously decrease the execution time.

Regarding the benchmark:

I would add another round of measures: naked pointers. Normally, unique_ptr and naked pointers should have the same performance, it would be worth checking it, and it need not necessarily be true in debug mode.
You might want to "interleave" the execution of the two batches or if you cannot, take the average of each among several runs. As it is, if the computer slows down during the end of the benchmark, only the unique_ptr batch will be affected which will perturbate the measure.

You might be interested in learning more from Neil: The Joy of Benchmarks, it's not a definitive guide, but it's quite interesting. Especially the part about forcing side-effects to avoid dead-code removal ;)

Also, be careful about how you measure. The resolution of your clock might be less precise than what it appears to be. If the clock is refreshed only every 15us for example, then any measure around 15us is suspicious. It might be an issue when measuring release code (you might need to add a few turns to the loop).

10,057

Author by

codablank1

Updated on June 23, 2022

Comments

codablank1 almost 2 years

I often read that unique_ptr would be preferred in most situations over shared_ptr because unique_ptr is non-copyable and has move semantics; shared_ptr would add an overhead due to copy and ref-counting;

But when I test unique_ptr in some situations, it appears it's noticably slower (in access) than its counterparts

For example, under gcc 4.5 :

edit : the print method doesn't print anything actually

#include <iostream>
#include <string>
#include <memory>
#include <chrono>
#include <vector>

class Print{

public:
void print(){}

};

void test()
{
 typedef vector<shared_ptr<Print>> sh_vec;
 typedef vector<unique_ptr<Print>> u_vec;

 sh_vec shvec;
 u_vec  uvec;

 //can't use initializer_list with unique_ptr
 for (int var = 0; var < 100; ++var) {

    shared_ptr<Print> p(new Print());
    shvec.push_back(p);

    unique_ptr<Print> p1(new Print());
    uvec.push_back(move(p1));

  }

 //-------------test shared_ptr-------------------------
 auto time_sh_1 = std::chrono::system_clock::now();

 for (auto var = 0; var < 1000; ++var) 
 {
   for(auto it = shvec.begin(), end = shvec.end(); it!= end; ++it)
   {
     (*it)->print();
   }
 }

 auto time_sh_2 = std::chrono::system_clock::now();

 cout <<"test shared_ptr : "<< (time_sh_2 - time_sh_1).count() << " microseconds." << endl;

 //-------------test unique_ptr-------------------------
 auto time_u_1 = std::chrono::system_clock::now();

 for (auto var = 0; var < 1000; ++var) 
 {
   for(auto it = uvec.begin(), end = uvec.end(); it!= end; ++it)
   {
     (*it)->print();
   }
 }

 auto time_u_2 = std::chrono::system_clock::now();

 cout <<"test unique_ptr : "<< (time_u_2 - time_u_1).count() << " microseconds." << endl;

}

On average I get (g++ -O0) :