Why SpMV works so well on the Xeon Phi

4th March 2015

It turns out that Sparse Matrix Vector Multiplication on the Intel Xeon Phi can be quite fast.

Generally, SpMV is limited by DRAM latency rather than bandwidth, due to the fact we are constantly making random accesses into the sparse array. DRAM design and modern cache architectures means that random access performance is a very small fraction of peak performance, in the low single digit GB/s versus ~60GB/s for sequential accesses on a modern processor with 4 DRAM channels per socket.

However, SpMV is embarrassingly (\blushes) parallel. This means if you have a memory architecture that can support a highly parallel workload it could run faster. On the Xeon Phi (Knights Corner at time of writing) there are 16 independent memory channels with 16 banks of GDDR on each channel. This means that under ideal conditions (well distributed addresses) an application can potentially access 256 memory banks in parallel.

I recently worked on a project with Rock Solid Images to try and implement EMGEO geophysical software on the Xeon Phi. It spends most of its time doing SpMV and we suspected that it could run faster if ported to the Xeon Phi architecture. The SpMV kernel inside EMGEO uses a hybrid storage format based on ELLPACK, optimised for use with the sparse matrix structure required for EM inversion.

spmv1

Figure 1. Simplified diagram showing how SpMV on the Xeon Phi utilises masked execution and gather capabilities

This SpMV kernel is part of a larger Quasi-Minimum Residual (QMR) solver for iteratively solving the EM equation. The number of non-zeros per row in the sparse matrix is between 5 and 13 which is perfect for fitting inside the 512-bit vector registers on the Xeon Phi. The masked execution capabilities of the KNC instruction set mean that all of the conditional execution per row can be handled at the register level without branching.

Figure 1 above illustrates the basic process of SPMV on the Xeon Phi and how this utilises conditional vectorised execution using mask registers and also the hardware vector gather functionality.

Kernel Development and Optimisation

I isolated the SpMV kernel from the rest of the existing Fortran code and wrote a new version in C using KNC intrinsics (Fortran does not support vector intrinsics).

I then applied the following optimisations to the kernel:

  • Aligned accesses everywhere through padding
  • Register tiling
  • Explicitly vectorised complex computation (I talked a little about this in another post here)
  • Masked execution throughout to eliminate branches
  • Looping peeling for remainders
  • Loop splitting for improved TLB performance
  • Streaming stores where appropriate
  • D-index gather support (Don’t use Q-Index gathers!)
  • Software prefetching optimised through an outer loop process
  • Explicit early cache line evictions

With all of the optimisations above, the SpMV kernel was ~2.8x faster than the original Fortran code on a single core on the Xeon Phi.

I integrated the kernel back into the rest of the existing Fortran code and tested on a MPI cluster of 32 5110p Xeon Phis.

The results were really good.  For the equivalent cost of hardware at the time of testing (it is likely much better than this now!) we saw a 3x performance improvement in the overall runtime of EMGeo.

Conclusions:

  • SpMV can be fast on the Xeon Phi
  • It can be easy to integrate if you have existing C or Fortran code
  • If you get into optimising the hot kernel in depth, you can get really good results

 

Share This

Tweet this!   Share on LinkedIn   Share on Facebook

Leave a Reply

Your email address will not be published. Required fields are marked *