14th October 2015
I recently spoke at EARL in London. For those of you who weren’t lucky enough to be there, here’s what you missed. For those who were there, here’s a lasting reminder of what I’m sure was a momentous time in your life.
The main point is how can we productise exploratory R code efficiently and effectively using license free tools. The slides are shown below with comments below some of the slides. In many cases we have used Python wrappers to build additional infrastructure around our R code given the general purpose power of Python compared to R.
The key point of this presentation is to outline our experience at Analytics Engines in the productisation of two existing R based algorithms, one in Bio-informatics/Pharma and the other in Financial technology.
It is often the case that exploratory code developed in a iterative fashion to solve a data science question may not be production ready. It is often too slow, not robust under error conditions, not scalable to multiple machines/cores and requires manual driving.
To reduce the pain of productising existing exploratory code we want to minimise the amount of code changes required. We can achieve this by leveraging existing technologies as much as possible. Watch out for NIH (not invented here) syndrome and instead reuse mature components.
Automate: Remove manual steps and do as much as possible programmatically.
Parameterise: Choose what knobs need to be turned and expose only these to the rest of the system.
Optimise: Find the slow parts of your program and make them faster.
In first case we looked at a pipeline containing multiple technologies used for DNA microarray analysis.
To enable large scale discovery we needed to increase the amount of automation and make the code cloud-ready by removing tools with unsuitable licenses for cloud usage.
Wrapping R code up in RPy2 provides relatively easy integration with Python, meaning we can add more flexibility to the pipeline.
We want to retain as much of the existing core IP as possible from R. The additional tools built around this core IP allow us to significantly enhance its capability.
In our FinTech case study in bond analytics, our requirements were mainly around automation for daily loads from a financial data provider and also optimisation, so that new data could be processed nightly and new predictions made in time for reopening of the markets.
Sometimes the location of performance bottlenecks can be surprising. Our initial profiling tests using Intel Vtune uncovered an interesting bottleneck. It was assumed that the bottleneck was in the predictive algorithm implementation. We found that it was actually due to R garbage collection, triggered by the code creating a large number of temporary objects.
Rcpp is a fantastic tool easing the integration of R and C++. This can lead to huge performance improvements in some cases. For this piece of code the speedup was ~600x.
To generate the same results as your original R when using R Random Number Generators, be careful to use RNGScope and the RCPP RNG Api.
The R rbenchmark package is great for testing your optimisations in a standalone environment.
Our final pipeline is built in Python with RPy2 wrappers for the R. The ETL and data sink is PostgreSQL. The whole lot is wrapped in a SystemD service that schedules the daily pipeline.