Using Knights Corner (Xeon Phi) Gather Intrinsics with Fortran complex types

16th October 2014

I’ve been porting some code to the Intel Xeon Phi lately. It’s potentially a very exciting many-core device with a very parallel memory architecture with very high bandwidth (about 180GB/s sustained on a 7120p device). The current micro-architecture is called Knights Corner (KNC) with the next release, Knights Landing available sometime next year with even more exciting features including stacked DRAM technology.

We wrote a case study about our experience with a Geophysical application here (link to RSI case study).

Generally speaking, to get good performance out of the Xeon Phi device you need to make sure that your application meets at least the following criteria (although in practice there are notable exceptions to these rules):

1. It must scale to > 100 threads
2. It should be able to utilise the wide SIMD vector units

I like to re-write code hotspots using intrinsic functions if possible. This makes for an interesting problem if your original application is in Fortran, as intrinsic functions are only available in C compilers (both icc and gcc support them for a Intel lot of Intel CPUs).

For a particular problem I encountered recently, I was porting a sparse matrix vector multiplication kernel (SpMV) from Fortran 90 code to Xeon Phi intrinsics in C, to ensure we can meet criteria 2 effectively. The sparse matrix code contains a “gather” operation at its heart, where elements from the sparse matrix are loaded based on indexes in another vector.

In Intel host CPUs, since at least SSE2, loading sparse elements into a SIMD register required loading into scalar registers and repacking to a SIMD SSE/AVX register via L1 cache. The Knights Corner architecture in the current Xeon Phi supports a new type of gather instruction which packs sparse elements into a ZMM register based on a register vector of indices.

This example below will show how to load sparse complex doubles from Fortran using the KNC gather intrinsics. Based on 8 indices to Fortran complex doubles, it packs 16 doubles into 2 separate vector registers. One of the tricky aspects is that Fortran has native support for complex variables. That means to gather complex elements we need to transform the original set of indices to a set of indices to gather the individual real and imaginary components.

const __m512i evens = {0,8,1,9,2,10,3,11,4,12,5,13,6,14,7,15};
const __m512i odds = {8,0,9,1,10,2,11,3,12,4,13,5,14,6,15,7};
const __m512i zeros = {0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0};
const __m512i one = {1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1};
const __m512i two = {2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2};
const __m512d dzeros = {0,0,0,0,0,0,0,0};

// index is the int32_t* to our sparse indices
__m512i indr = _mm512_load_epi32(index);
__m512i gatherind;
__m512i exp,exp2,indr1;

__mmask16 blender = _mm512_int2mask(43690);
// convert F90 complex indices {0,2,4,5} -> {0,1,4,5,8,9,10,11}
// multiply by two
indr = _mm512_mullo_epi32(indr,two);
// subtract one for Fortran one based indices
indr1 = _mm512_sub_epi32(indr,one);
// subtract one again
indr = _mm512_sub_epi32(indr1,one);

exp = _mm512_permutevar_epi32(evens,indr);
exp1 = _mm512_permutevar_epi32(odds,indr1);

gatherind = _mm512_mask_blend_epi32(blender,exp,exp1);

// now do the gather
__mmask16 gathermask;
__mmask8 gathermask8;

// find the non zeros in gatherind
gathermask = _mm512_cmp_epi32_mask(zeros,gatherind,_MM_CMPINT_LE);
gathermask8 = (__mmask8)gathermask;

// sparse is the double * to the start of our sparse matrix
// gather using the low 8 indexes from gatherind
__m512d sparse1 = _mm512_mask_i32logather_pd(dzeros,gathermask8,gatherind,sparse,8);

// get the high byte of the mask
gathermask8 = (__mmask8)_mm512_kswapb(gathermask,gathermask);

// rotate the gatherind to get the high 8 in the low 8
//gatherind = _mm512_permute4f128_epi32(gatherind,14);
gatherind = _mm512_alignr_epi32(gatherind,gatherind,8);

__m512d sparse2 = _mm512_mask_i32logather_pd(dzeros,gathermask8,gatherind,sparse,8);

Based on this technique as well as a host of other micro-optimisations, we were able to achieve a ~3x performance improvement for this SpMV operation over the code generated by the Fortran compiler, without touching the rest of the code base.

Share This

Tweet this!   Share on LinkedIn   Share on Facebook

2 thoughts on “Using Knights Corner (Xeon Phi) Gather Intrinsics with Fortran complex types”

  1. How long did it take you to figure this out? Like all intrinsics code I’ve seen, this looks very difficult to read and maintain!

    I’ve written some code but this only uses more basic SSE instructions. When I did this I noticed that the pipeline stalls became very significant. I needed a speedup of 4x. Each SSE instruction processed in parallel 8 times, however I achieved practically no speedup.

    The problem appeared to be pipeline bubbles in the processor, each SSE instruction depended on the last. I found “interleaving” the instructions by a factor of 4 reduced the pipeline bubbles sufficiently to achieve the desired speedup. However, now there’s 4x the amount of SSE intrinisics, and it’s even harder to read and maintain!

    I wonder if this can be used to make your code even faster, you have a similar problem to mine (each instruction depends on the last). Balanced against that I suppose is the fact that no matter how fast you can get the CPUs to go can you get the RAM to keep up with this? After I created my SSE implementation I tried multi-threading it, only to find the system couldn’t get the data out of the system RAM and into multiple CPUs quickly enough.

    1. Hello Matt!

      I agree with you Intrinsic based code has an impact on maintainability and also portability. Sometimes the performance benefits are worth the effort. In this case, the application originally required 2 weeks to find a solution. With this change and quite a lot of other Intrinsic code to implement the SpMV kernel, this reduced to about 4 days. The rest of the application code is Fortran, but in order to use Intrinsics you need to use the C compiler.

      Regarding pipeline stalls, it isn’t really an issue with this code. The code above the almost completely memory latency bound due to the sparse accesses. In fact, the entire kernel of this SpMV operation is memory latency bound. However, it is not quite that simple for Knights Corner.

      Even though this code is memory latency bound on a regular Xeon and also on the Xeon Phi, you still need to vectorise the code explicitly and otherwise try to reduce the latency of the computational kernel (including interleaving dependent instructions to reduce pipeline stalls). The are several contributing reasons for this. Firstly the Xeon Phi compiler is not quite as good (yet) as the Xeon compiler at generated optimal vectorised code. For high performance on KNC you must use the 512 bit SIMD registers and if the compiler doesn’t work it out, then you need Intrinsics/assembly. Secondly and most importantly, the Xeon Phi is an in-order architecture so each core cannot analyse instruction dependencies and re-schedule them to reduce latency. This also means that explicitly prefetching your data with the appropriate distance makes a big difference compared a Xeon CPU. I wrote a separate test program to determine the optimal prefetch distances into L1 and L2.

      Despite all of these little intricacies the Xeon Phi can give exceptional performance when the application is designed correctly. If you have highly parallel memory accesses, massive floating point computational density or code that is memory bandwidth bound then it is quite possible you will get good performance on the Xeon Phi.

Leave a Reply

Your email address will not be published. Required fields are marked *