2nd August 2016
The title of this blog post seems highly logical doesn’t it? You should use the right data store for your data. Basically we feel that you shouldn’t force data into a datastore that is not a natural fit to that data.
There are many (approx. 300) datastores in the market.
The reason there are 300+ is that there are 300+ different requirements for a datastore that are not possible to address inside a single system. This is not a new concept, but an application and logical extension of CAP Theorem
Let’s take a real world example… Say you require a means of storing data that records the transactions for your bank account. In this case, you need a transactional database that can provide you with an up to date statement of your current balance so you know if you have enough cash to buy something. In this context, it is essential that the data is always up to date. If the data is out of date then then it cannot work out your balance accurately. So if it is only updated weekly or daily or even every minute it is not going to be able to show your correct balance. In this context the system only works when the balance reflects actual balance.
When we look at the database that is required for this, it is very clear what the fields are for the records (transaction id, timestamp, value, notes). Where the transaction id is a unique primary key, the timestamp is a record based on standard time reference of the order of the transaction and the value indicates an increment or decrement of the relevant value, notes are a free-form field to make the transaction id more usable.
Lets take a very different real world example… Social Media. How important is it that your twitter account is always 100% up to date and error free? What’s the worst case scenario… Say you are you are looking at your social feed and you see an update from someone in your network slightly later than someone else. Is this likely to be hugely problematic? It seems unlikely… In the real world, this has little relevance and can happily be traded off to gain the benefit of scale (typically implementing eventual consistence). But on the other hand in a social media system there are not just thousands of people and systems updating records as is the case with banking, but millions. Social media systems are designed and built to enable millions of people to simultaneously update the data.
These are two simple examples of different data stores for different purposes. However, they are not the only two: Karim Vaes provides a good overview of a number of different types of Databases and datastores and in particular created this simple graphic to highlight some differences in the structures of these systems .
When we start to think about analytics, things change. It’s not just about storing data and retrieving it but analysing this data and gaining insight. In this context we do not want to be limited to the data inside any single environment, but rather it is essential to have access to all the data necessary to create the analysis we want to create. In early analytics environments the challenge was to carry out an analysis across a very specific dataset. In this case, typically the problem addressed was constrained inside one data environment or was sufficiently small/manageable that it was possible to import different datasets into a single analytics environment.
These days things are different. Analytics has moved on and it is no longer feasible to transform data and move it into an environment to facilitate analysis. The data is either too large or we want to create an analysis that includes up to date data (not necessarily real-time in the strictest definition, but time relevant data).
Another factor to consider is that we do not want analysis to be carried out as a special task. To have a real impact, analytics is integrated as a ‘normal’ part of business operations. So every person in an organisation should be able to draw on analytics and analytics results to clarify or enable a decision in their business day. Building or accessing analytics results should not be a special task that needs preparation to complete, but should be considered normal practise. While this may seem alien to many today, it is akin to looking at the current balance on an electronic ledger. This has become normal practise to understand if you are in budget and is common and understood by everyone. It is no longer necessary to ask an accountant to provide a current balance of budget, reviewing the balance at the bottom of a spreadsheet is something we understand and accept as standard practice.
So clearly an alternative is required where we can carry out analytics across large quantities of data and real-time or near real-time data as a standard business process, without having to create a project every time we need an answer (akin to asking the accountant for the account balance).
This is exactly the challenge that Analytics Engines XDP fulfils. Using XDP data is not moved around and transformed from one system to another, you do not need to export all your data out into Hadoop or integrate you social feed into your database. Instead XDP works with your existing environments. XDP transforms queries from one environment into another enabling the query to be moved around the system not the data. This is in effect moving the query and the compute to the data rather than moving the data to the compute, which has been the challenge of computer science since we started addressing big data and multi-core processing. This is where we started to understand the cost of moving data vs the cost of moving compute. Before using XDP, this challenge has only ever been answered at low-level computing, never for the business user. The low-level compute mechanism for addressing the moving of compute to data has been to execute the operation inside a virtual machine and enable virtual machines to run in many different locations. But this is simply not accessible to the casual user and cannot be adopted as a casual business process. With XDP the user is never exposed to Java or virtual machines, but instead is provided access to SQL and can use all of the standards SQL tools that are available for the business user. At a very minimal capability SQL can be accessed by Excel and any tools more complex than this…. Effectively this implies that almost any tool on the market can utilise XDP and the benefits it brings without forcing the user to adopt or learn a new tool-set or environment. (e.g. Tableau, Qlik, Excel, etc).
In effect XDP enables users to open up the capabilities of existing tools such as Qlik, Tableau or even Excel to integrate and analyse across different data types.
XDP leverages the unique data integration features within PostgresQL implemented as data connectors to target data sources. Through utilisation of PostgresQL Analytics Engines have been able to adopt the full features from a database with a very strong pedigree and extend its capabilities to provide seamless integration of additional data stores. Analytics Engines has written a number of connectors to integrate the main data sources expected in typical environments (e.g. Oracle, MS-SQL, Hadoop, MySQL, etc). New connectors are continually being developed as customers request additional integration points. The whole concept of data connectors and the value of these will be addressed in more detail in our next blog post. Through the adoption of SQL users can re-use existing SQL based tools, and crucially re-use existing skills within their organisation. There is no necessity to learn new languages or tools to enable users to analyse data across different data sources as most organisations already have team members skilled in SQL or using tools that access SQL systems.
Through the adoption of XDP customers can up-skill their teams by providing them the ability to access data in virtually any data environment from within one platform. In effect this enables each member of your team to become a data Polyglot.
In a practical example you may have different data systems within your organisation for different purposes such as those shown below. Each system is configured and installed on a data store that is specific for the task at hand. Additionally, any single system could not take on the roles of the other systems in the design so this is in effect the minimal configuration of systems. Traditionally this type of system is difficult to analyse across as it would require the creation of another system dedicated for analytics and the creation of a number of specific export functions against each of the original systems. This is a lot of upfront cost and effort, particularly if the return on investment for the analytics is still unproven. Of course this is the big challenge of analytics. how do we know the RoI before we do the analytics? It’s a bit of a catch 22 situation where the results of the analytics show the value of the analytics but the cost of doing the analytics is prohibitive to finding out the value.
On top of the original effort to pull data from the data systems IT processes will need to be created to enable the data to be regularly pulled from the data systems or accept that the data quickly becomes stale.
Contrary to the traditional design of analytics systems where data needs to be exported before it can be analysed, XDP enables the users to analyse their data in-situ. Data systems already have intelligence and query functions associated with them, but traditional analytics systems ignore those functions and utilise only the simplest possible option (export). XDP uses the underlying query functions of each data store to allow it to query, filter, sort, process and analyse data within and across individual data stores. In effect XDP takes the query the user wishes to run and rewrites the part of that query that is relevant to a particular data store into the language used by that tool. For example: if you have ERP data in Oracle and social media data in MongoDB, you can write a single query to ‘JOIN’ these two different datasets based on userid and return the tweets of users who bought particular products. In behind the scenes XDP will take the userid handle and lookup the ERP sales records for particular products and return the users who bought that product using Oracle SQL. These user records are then used as the search index to find the social media tweets for those particular users (using a MongoDB query). Of course this can be further extended by including the sales dates and then considering how opinions and emotions change before and after sales. All this achieved by doing a simple JOIN function between two datasets presented inside XDP.
In future blog posts we will expand on data connectors and the concept of data virtualisation as well as giving some examples of XDP in use. Subscribe to the blog to keep up to date with our updates.