Accelerating Bioinformatics Pipelines: BIBM 2014

13th November 2014

I recently attended the IEEE BIBM conference hosted by the University of Ulster in Belfast. Given that not all my fans were able to make the journey to this part of the world, I thought I’d share the slides for those not lucky enough to be there.

img0

The presentation covered our view of the big, big world of bioinformatics and some of the challenges facing translation to the clinic, as well as two examples of work we have (very) recently carried out in the field.

img1

We achieve the above by analysing how data flows through your system, identifying the bottlenecks and applying software, hardware or a combination of both to address these bottlenecks. By optimising the system performance as a whole, we are able to deliver much higher performance gains than would be possible with a single point solution.

img2

We are interested in the field of bioinformatics due to the growing data volumes and large computational requirements. I’m particularly interested on a personal level in the knowledge we are gaining about ourselves and nature. I’d like to explore the first two reasons in more detail.

img3

The above graph (extracted the night before the conference, I hope you will note) shows the number of reads stored by the European Nucleotide Archive (ENA) over the last nine years. The archive currently holds around 1PBase of reads (1,000,000,000,000,000 reads), this is doubling every 18 months. Large scale genomics projects such as Genomics England’s 100k genome project look set to surpass this volume of data in a matter of months. My back of the envelope calculations indicate that this project will need to store roughly 10PBase of reads over the next three years – ten times as much as the ENA currently holds.

Storing this amount of data is a hard problem to have but it is already being done. Facebook was adding 7PBytes of storage per month at the turn of 2013. I believe that the bigger problem is going to be transforming this data into useful information, i.e. performing computation.

img4

We’ve grown used to computational power doubling every 18 months, as described by Moore’s Law. The important thing to remember about Moore’s Law is that it isn’t actually a law, more a rule of thumb. The graph above shows the ‘law’ (green line) and some of its key drivers, each of the drivers shown having reached a limit during the last 40 years. One of the major drivers has been the continuing reduction in size of transistors, which will eventually meet phyical limits.

img5

The last couple of slides have demonstrated that the data volumes associated with bioinformatics will grow faster than available computational power over the next few years – the traditional approach of buying newer, faster servers every few years won’t work. One approach to overcoming this challenge is to ‘scale out’ computation, i.e. run analysis in parallel across multiple servers. However, data transfer technology is improving at a slower rate than computational power, so scaling out alone is unlikely to provide a long-term solution. What Analytics Engines intend to do is to make better use of the available computational power, matching pipelines more closely to the hardware and removing inefficiencies. In this way, we will move the processing line closer to the data line.

img6

Cost is massively important: health service decisions are determined by some form of cost-benefit analysis. The cheaper the diagnosis or analysis, the more likely it is to be adopted.

Volume is a barrier to adoption of new methods: any new approach needs to be available to all of the people who need it. Several speakers refererred to the population of Northern Ireland as ‘small’. However, it is still 18 times larger than the largest planned genomic study to date. It is important to note that cost needs to scale linearly (or better) for large scale clinical adoption to occur.

Privacy and data management need to be considered at every stage; not only are bioinformatic methods dealing with extremely personal data, they are also used to make extremely important decisions. All analyses will need to be repeatable many years after the clinical use, in order to allow traceability of errors in decision making.

I will now discuss two case studies in which we have addressed the first two of these three areas for improvement.

img7

We have been working with Almac Diagnostics to improve the execution time of their unsupervised clustering pipeline, which is used as part of the company’s biomarker discovery process. Sinead Donegan presented the pipeline on the Industrial Track at BIBM, and Timothy Davison provided a keynote on Almac Diagnostics’ work in biomarker discovery. The execution time of this pipeline was the limiting factor on the volume of analysis that could be performed.

The pipeline uses the gap statistic to identify the optimum number of clusters and number of probe sets from microarray data.

img8

The gap statistic uses bootstrapping (the generation and clustering of random data sets) to produce a baseline for comparison of clustering the original data set. This is computationally expensive and consumes the vast majority of run-time for the pipeline.

img9

The second case study is still ongoing. We are collaborating with nSilico to reduce the cost of running the BLAST sequence search method on the company’s Simplicity platform. Paul Walsh kindly provided the pie chart, which shows that BLAST consumes 84% of all computation on the platform.

img10

Early results point to at least a threefold cost reduction for running an Analytics Engines accelerated BLAST on Simplicity.

img11

Thanks for reading, please get in touch if you would like any more information.

img12

Share This

Tweet this!   Share on LinkedIn   Share on Facebook

Leave a Reply

Your email address will not be published. Required fields are marked *