9th February 2015
As detailed in a previous blog it is possible to achieve an overall speedup of 80x (versus a single 3.2GHz Core-i5 thread) by offloading the bottleneck code of NCBI blastp to a Virtex-7 FPGA platform. This translates to 10x on a server-server comparison. Although the FPGA implementation is promising, it involves a long development and verification process. An alternative acceleration technique is to adopt the latest Intel Xeon Phi coprocessor, which is able to yield greater performance for highly parallel applications than general-purpose Intel Xeon CPUs of comparable cost and thermal design power.
Considering that the NCBI code is already highly optimized for multi-threading, the preliminary test is to run a native blastp on a single Xeon Phi. This shows very poor performance, particularly in searching a large protein database. We found a significant disk I/O overhead of reading database files on the host via mounted NFS from a Xeon Phi. Furthermore, the blastp performance does not scale well on multiple Xeon Phis mainly due to a highly imbalanced workload across the host processors and co-processors.
To cope with the issues described above, we have developed a sophisticated approach to achieve the best performance, which involves several schemes as below:
The performance figures are listed in the table below, as shown the heterogeneous system (AE Appliance) with 4 Phis and a dual Xeon E5 can achieve a speedup of more than 6x compared to the baseline commercial platform which typically utilizes 8 threads of a dual Xeon E5 for running a blastp instance.
|Commercial platform (90GB RAM)||Dual Xeon E5-2450 v2 @ 2.1GHz (8 threads)||28h||1x|
|AE Appliance (64GB RAM)||Four Xeon Phi 31S12||8h38m||3.24x|
|Four Xeon Phi 31S1 + Dual Xeon E5-2690 v3 @ 2.60GHz (48 threads)2||4h38m||6.04x|
These results are encouraging, and we hope to further scale up single server performance by using more advanced hardware to run blastp on 8 Xeon Phis within a single node, as well as develop an optimized offload architecture for easy scale-out on multiple nodes.
Please feel free to contact email@example.com for more details.
1The runtime figures are based on a protein database with 80M sequences (38 GB) and a query with 5550 sequences.
2All Xeon Phis use a maximum of 224 threads.