Efficient 4x4 matrix vector multiplication with SSE: horizontal add and dot product - what's the point?

c performance optimization sse matrix-multiplication

11,289

Horizontal add and dot product instructions are complex: they are decomposed into multiple simpler microoperations which are executed by processor just like simple instructions. The exact decomposition of horizontal add and dot product instructions into microoperations is processor-specific, but for recent Intel processors horizontal add is decomposed into 2 SHUFFLE + 1 ADD microoperations, and dot product is decomposed into 1 MUL + 1 SHUFFLE + 2 ADD microoperations. Besides a larger number of microoperations, this instructions also stress the instruction decoder in the processor pipeline: Intel processors can decode only one such complex instruction per cycle (compared to 4 simple instructions). On AMD Bulldozer the relative cost of these complex instructions is even higher.

11,289

Author by

Admin

Updated on June 14, 2022

Comments

Admin almost 2 years

I am trying to find the most efficient implementation of 4x4 matrix (M) multiplication with a vector (u) using SSE. I mean Mu = v.

As far as I understand there are two primary ways to go about this:

    method 1) v1 = dot(row1, u), v2 = dot(row2, u), v3 = dot(row3, u), v4 = dot(row4, u)
    method 2) v = u1 col1 + u2 col2 + u3 col3 + u4 col4.

Method 2 is easy to implement in SSE2. Method 1 can be implement with either the horizontal add instruction in SSE3 or the dot product instruction in SSE4. However, in all my tests method 2 always outperforms method 1.

One place where I though method 1 would have an advantage is in a 3x4 matrix, for example for affine transform. In this case the last dot product is unnecessary. But even in this case method 2 on a 4x4 matrix is faster than method 1 on a 3x4 matrix. The only method I have found that is faster than method 2 on a 4x4 matrix is method 2 on a 4x3 matrix.

So what's the point of the horizontal add and the dot product instruction? In fact the dot production instruction gives the worst performance in this case. Maybe it has something to do with the format of the data? If one can't define how the matrix is ordered then a transpose is necessary and in that case maybe method 1 would be better?

See below for some code.

__m128 m4x4v_colSSE(const __m128 cols[4], const __m128 v) {
  __m128 u1 = _mm_shuffle_ps(v,v, _MM_SHUFFLE(0,0,0,0));
  __m128 u2 = _mm_shuffle_ps(v,v, _MM_SHUFFLE(1,1,1,1));
  __m128 u3 = _mm_shuffle_ps(v,v, _MM_SHUFFLE(2,2,2,2));
  __m128 u4 = _mm_shuffle_ps(v,v, _MM_SHUFFLE(3,3,3,3));

  __m128 prod1 = _mm_mul_ps(u1, cols[0]);
  __m128 prod2 = _mm_mul_ps(u2, cols[1]);
  __m128 prod3 = _mm_mul_ps(u3, cols[2]);
  __m128 prod4 = _mm_mul_ps(u4, cols[3]);

  return _mm_add_ps(_mm_add_ps(prod1, prod2), _mm_add_ps(prod3, prod4));
}

__m128 m4x4v_rowSSE3(const __m128 rows[4], const __m128 v) {
  __m128 prod1 = _mm_mul_ps(rows[0], v);
  __m128 prod2 = _mm_mul_ps(rows[1], v);
  __m128 prod3 = _mm_mul_ps(rows[2], v);
  __m128 prod4 = _mm_mul_ps(rows[3], v);

  return _mm_hadd_ps(_mm_hadd_ps(prod1, prod2), _mm_hadd_ps(prod3, prod4));
}

__m128 m4x4v_rowSSE4(const __m128 rows[4], const __m128 v) {
  __m128 prod1 = _mm_dp_ps (rows[0], v, 0xFF);
  __m128 prod2 = _mm_dp_ps (rows[1], v, 0xFF);
  __m128 prod3 = _mm_dp_ps (rows[2], v, 0xFF);
  __m128 prod4 = _mm_dp_ps (rows[3], v, 0xFF);

  return _mm_shuffle_ps(_mm_movelh_ps(prod1, prod2), _mm_movelh_ps(prod3, prod4),  _MM_SHUFFLE(2, 0, 2, 0));
}

Admin about 11 years

Thanks, that explains why the instructions are slow. However, it does not explain why they were implemented. But I think I know now. Method 2 requires the data be a struct of arrays (SoA), i.e. column ordered, to be optimal. If the data is an array of structs (AoS), i.e. row ordered, a transpose has to be done and in this case method 1 is much faster. In other words if the data can be defined make it an SoA instead of an AoS and use method 2. Otherwise use method 1 with horizontal addition. Don't use the dot production instruction for matrix multiplication.
Chuck Walbourn over 9 years

CPU vendors have a history of adding new instructions that could be very useful, but initially dedicating very little hardware to implementing them. If they get adopted by enough programs, then they eventually add more hardware to actually make the instruction faster. First-generation _mm_dp_ps isn't really any faster than the usual SSE or SSE3 approach to doing this, although in theory it should be a bit less code-bloat if you are doing a lot of them.
St0fF over 7 years

If you look at the Intel Intrinsics guide: link, you see the performance figures. This should also help explain why the dp-solution is by far outperformed even by the hadd-solution.