BLAST on Xeon Phi

9th February 2015

As detailed in a previous blog it is possible to achieve an overall speedup of 80x (versus a single 3.2GHz Core-i5 thread) by offloading the bottleneck code of NCBI blastp to a Virtex-7 FPGA platform. This translates to 10x on a server-server comparison. Although the FPGA implementation is promising, it involves a long development and verification process. An alternative acceleration technique is to adopt the latest Intel Xeon Phi coprocessor, which is able to yield greater performance for highly parallel applications than general-purpose Intel Xeon CPUs of comparable cost and thermal design power.

Considering that the NCBI code is already highly optimized for multi-threading, the preliminary test is to run a native blastp on a single Xeon Phi. This shows very poor performance, particularly in searching a large protein database. We found a significant disk I/O overhead of reading database files on the host via mounted NFS from a Xeon Phi. Furthermore, the blastp performance does not scale well on multiple Xeon Phis mainly due to a highly imbalanced workload across the host processors and co-processors.

To cope with the issues described above, we have developed a sophisticated approach to achieve the best performance, which involves several schemes as below:

  1. Make the most of the power of all (co-) processors.
  2. Apply load balance by segmenting the workload into small jobs.
  3. Take advantage of host page cache in order to reduce the data read overhead.

The performance figures are listed in the table below, as shown the heterogeneous system (AE Appliance) with 4 Phis and a dual Xeon E5 can achieve a speedup of more than 6x compared to the baseline commercial platform which typically utilizes 8 threads of a dual Xeon E5 for running a blastp instance.

Commercial platform (90GB RAM)Dual Xeon E5-2450 v2 @ 2.1GHz (8 threads)28h1x
AE Appliance (64GB RAM)Four Xeon Phi 31S128h38m3.24x
Four Xeon Phi 31S1 +  Dual Xeon E5-2690 v3 @ 2.60GHz (48 threads)24h38m6.04x

These results are encouraging, and we hope to further scale up single server performance by using more advanced hardware to run blastp on 8 Xeon Phis within a single node, as well as develop an optimized offload architecture for easy scale-out on multiple nodes.

Please feel free to contact for more details.

1The runtime figures are based on a protein database with 80M sequences (38 GB) and a query with 5550 sequences.
2All Xeon Phis use a maximum of 224 threads.

Share This

Tweet this!   Share on LinkedIn   Share on Facebook

Leave a Reply

Your email address will not be published. Required fields are marked *