Solving Business Issues in Big Pharma and Fintech using R: EARL 2015

14th October 2015

I recently spoke at EARL in London. For those of you who weren’t lucky enough to be there, here’s what you missed. For those who were there, here’s a lasting reminder of what I’m sure was a momentous time in your life.

The main point is how can we productise exploratory R code efficiently and effectively using license free tools.  The slides are shown below with comments below some of the slides.  In many cases we have used Python wrappers to build additional infrastructure around our R code given the general purpose power of Python compared to R.

EARL 2015 Title

EARL 2015 Overview

The key point of this presentation is to outline our experience at Analytics Engines in the productisation of two existing R based algorithms, one in Bio-informatics/Pharma and the other in Financial technology.

EARL 2015 Data Science 1

It is often the case that exploratory code developed in a iterative fashion to solve a data science question may not be production ready.  It is often too slow, not robust under error conditions, not scalable to multiple machines/cores and requires manual driving.

EARL 2015 Data Science 2

To reduce the pain of productising existing exploratory code we want to minimise the amount of code changes required. We can achieve this by leveraging existing technologies as much as possible. Watch out for NIH (not invented here) syndrome and instead reuse mature components.

EARL 2015 Data Science 3

Automate: Remove manual steps and do as much as possible programmatically.

Parameterise:  Choose what knobs need to be turned and expose only these to the rest of the system.

Optimise:  Find the slow parts of your program and make them faster.

EARL 2015 Pharma 1

In first case we looked at a pipeline containing multiple technologies used for DNA microarray analysis.

EARL 2015 Pharma 2

To enable large scale discovery we needed to increase the amount of automation and make the code cloud-ready by removing tools with unsuitable licenses for cloud usage.

EARL 2015 Pharma 3

 

Wrapping R code up in RPy2 provides relatively easy integration with Python, meaning we can add more flexibility to the pipeline.

EARL 2015 Pharma 4

 

From our Python pipeline, it is relatively easy to add HDFS backend storage, Celery scalable processing and thus integration with scalable compute on AWS.EARL 2015 Pharma 5

We want to retain as much of the existing core IP as possible from R. The additional tools built around this core IP allow us to significantly enhance its capability.

EARL 2015 FinTech 1

EARL 2015 FinTech 2

In our FinTech case study in bond analytics, our requirements were mainly around automation for daily loads from a financial data provider and also optimisation, so that new data could be processed nightly and new predictions made in time for reopening of the markets.

EARL 2015 FinTech 3

EARL 2015 FinTech 4

Sometimes the location of performance bottlenecks can be surprising. Our initial profiling tests using Intel Vtune uncovered an interesting bottleneck.  It was assumed that the bottleneck was in the predictive algorithm implementation.  We found that it was actually due to R garbage collection, triggered by the code creating a large number of temporary objects.

EARL 2015 FinTech 5

Rcpp is a fantastic tool easing the integration of R and C++. This can lead to huge performance improvements in some cases.  For this piece of code the speedup was ~600x.

To generate the same results as your original R when using R Random Number Generators, be careful to use RNGScope and the RCPP RNG Api.

EARL 2015 FinTech 6

The R rbenchmark package is great for testing your optimisations in a standalone environment.

EARL 2015 FinTech 7

Our final pipeline is built in Python with RPy2 wrappers for the R.  The ETL and data sink is PostgreSQL. The whole lot is wrapped in a SystemD service that schedules the daily pipeline.

EARL 2015 Conclusions

Share This

Tweet this!   Share on LinkedIn   Share on Facebook

Leave a Reply

Your email address will not be published. Required fields are marked *