Data Lake vs Data Warehouse

17th April 2015

Data Lakes are “marketed as enterprise wide data management platforms for analyzing disparate sources of data in its native format” [Gartner]. A data lake typically uses an ELT (Extraction, Load and Transform) process to ingest new data. Data stored within the data lake can be structured and unstructured and does not require prior knowledge of the analyses you think you want to perform.

In contrast Data Warehouses are defined as “central repositories of integrated data from one or more disparate sources. They store current and historical data and are used for creating trending reports for senior management reporting such as annual and quarterly comparisons” [Wikipedia]. Data Warehouses require structuring to be performed  using ETL (Extraction Transform and Load) when importing the data. It depends upon having prior knowledge of the queries to be performed against the data.

Data_warehouse_overview

Data Warehouse

Data Lake Overview

Data Lake

ETL vs ELT

When using a Data Warehouse the IT team need to put in place suitable ETL processes to extract data from original data sources and bring it into the Data Warehouse (or Business Warehouse). At the point of loading data into the Data Warehouse the query and analysis for the Data Warehouse and the schema used must be known. In particular if the query needs to be modified then it is necessary to re-injest the data into the Data Warehouse.

Alternatively, when using the Data Lake an ELT process is used where little or no transform is incurred during the load stage and transform is applied on demand when accessing the data. In this format the data is pulled in its raw format into the Data Lake leaving the definition of the schema until the data is pulled and accessed. In this model sufficient processing capability is required to calculate and deliver the result to the user in reasonable timescales for the consumer.

Data Academy have put together a great document analysing ETL vs ELT. Summarised as:

  • ETL:
    • ETL and BI projects by nature are investments into evolving processes and therefore have no distinct end point and is an ongoing, improving and re-targeting project process.
    • ETL works from the output backwards and hence on relevant data is extracted and processed.
    • Future ETL requirements needing data cannot be foreseen and defined in the original design.
  • ELT:
    • Isolating Loading and Transforming enables projects to be broken down into specific chunks that are more isolated and become more manageable.
    • ELT is an emergent approach to data warehouse design and development requiring a change in mentality and design approach compared to traditional ETL.
    • Future requirements can easily be incorporated into the warehouse structure as all data is pulled into the Data Lake in its raw format.

So…Is a Data Lake really unstructured?

No, not quite. A Data Lake is composed of a lot of directories containing your data arranged as you feel is most appropriate to manage them. This can include both structured and unstructured (free text) data. The big difference with a Data Lake is that you do not need to perfect the schema as a BDUF (Big Design Up Front).

Andrew C. Oliver captures this very well in his BLOG
How to create a data lake for fun and profit. He summarises this down to four simple steps:

  1. Identify a few use cases
  2. Build the lake
  3. Get data from lots of different sources into the lake
  4. Provide a variety of fishing poles and see who lands the biggest and best trout (or generates the most interesting data-backed factoid)

Pros/Cons for Data Lake

Closely related to the ETL vs ELT debate Tamara Dull and Anne Buff provide a great series of articles comparing Data Lake and Data Warehouse arguing for and against:

  • Pro Data Lake
    • Any and all data can be captured and stored in a data lake
    • A data lake allows for more questions and better answers
  • Con Data Lake
    • Data storage alone has no impact on the effectiveness of business decisions
    • Inexpensive storage is not infinite or limitless

What do I need to know before I start my Data Lake?

Chris Twogood identifies 5 questions to ask before selecting a solution:

  1. How will this big data solution handle the rush of data today and tomorrow?
  2. What is the total cost of the big data solution?
  3. Is the estimated time to value acceptable?
  4. What direct and indirect benefits should you expect from a big data solution?
  5. Are next generation short cuts or implementation aids available?

Share This

Tweet this!   Share on LinkedIn   Share on Facebook

One thought on “Data Lake vs Data Warehouse”

Leave a Reply

Your email address will not be published. Required fields are marked *