17th April 2015
Data Lakes are “marketed as enterprise wide data management platforms for analyzing disparate sources of data in its native format” [Gartner]. A data lake typically uses an ELT (Extraction, Load and Transform) process to ingest new data. Data stored within the data lake can be structured and unstructured and does not require prior knowledge of the analyses you think you want to perform.
In contrast Data Warehouses are defined as “central repositories of integrated data from one or more disparate sources. They store current and historical data and are used for creating trending reports for senior management reporting such as annual and quarterly comparisons” [Wikipedia]. Data Warehouses require structuring to be performed using ETL (Extraction Transform and Load) when importing the data. It depends upon having prior knowledge of the queries to be performed against the data.
When using a Data Warehouse the IT team need to put in place suitable ETL processes to extract data from original data sources and bring it into the Data Warehouse (or Business Warehouse). At the point of loading data into the Data Warehouse the query and analysis for the Data Warehouse and the schema used must be known. In particular if the query needs to be modified then it is necessary to re-injest the data into the Data Warehouse.
Alternatively, when using the Data Lake an ELT process is used where little or no transform is incurred during the load stage and transform is applied on demand when accessing the data. In this format the data is pulled in its raw format into the Data Lake leaving the definition of the schema until the data is pulled and accessed. In this model sufficient processing capability is required to calculate and deliver the result to the user in reasonable timescales for the consumer.
Data Academy have put together a great document analysing ETL vs ELT. Summarised as:
No, not quite. A Data Lake is composed of a lot of directories containing your data arranged as you feel is most appropriate to manage them. This can include both structured and unstructured (free text) data. The big difference with a Data Lake is that you do not need to perfect the schema as a BDUF (Big Design Up Front).
Andrew C. Oliver captures this very well in his BLOG
How to create a data lake for fun and profit. He summarises this down to four simple steps:
Closely related to the ETL vs ELT debate Tamara Dull and Anne Buff provide a great series of articles comparing Data Lake and Data Warehouse arguing for and against:
Chris Twogood identifies 5 questions to ask before selecting a solution: