What is a data lake?

What is a data lake?

Let's start at the beginning … "A data lake is usually a single store of all enterprise data including raw copies of source system data and transformed data used for tasks such as reporting, visualization, analytics and machine learning."

Image result for what is a data lake used for
(and AWS)

This is something ofc that you can google and get this (or similar) definition.

So a data lake is in its simplest form a data store of structured (curated) and non-structured (non-curated data) - both slow and fast moving.

Sounds great, but what isn't it then?

Well, it's NOT...

A data warehouse or a data hub/mart.

These are very different things. (Another blog/another day :-) )

Image result for data lake vs data warehouse vs data mart

A general data store used for any system that "wants" this data

Data lakes should not be used in an 'ooh look, there's the data I need' I'll just grab it from there. Data use must follow a DATA GOVERNANCE model - by the definition above, a data lake is for tasks such as reporting, visualization, analytics, and machine learning - and this business use would follow our governance model, and the curated data in the lake would be appropriate for this use - and not for other uses. Also, typically data lakes contain curated and non-curated data (raw), difference schemas, relational and non-relational data and so are less likely to be index and optimized for application access.

A golden source

A golden source is the SINGLE source of truth - a data lake can / should and will contain golden data- we call this a golden copy - but it's not the active book of record - and a data lake shouldn't be used as such

An operational data source

We shouldn't look to run applications or business facing operational activity on a data lake - there is a temptation to think that once we have enterprise data coalesced into a single place, then we're 'done' and we can hang every downstream application/system from it.

I propose care here - the temptation is there - and there is a view that suggests data lakes can become a core part of the data infrastructure, replacing existing data marts or operational data stores and enabling the provision of data as a service - however, this is a LONG way up the maturity curve and starting off in this vain may cripple getting things up and running in the first place. I prefer to think that we would want to move through an evolution - as described by the diagram below from McKinsey - and keep remembering what it’s for - tasks such as reporting, visualization, analytics, and machine learning.

Based on that -

Should you have a data lakes

I think so yes, but let's not get ahead of ourselves - we need to be clear on the problem we are trying to solve. There are NO silver bullets - and a data lake will form part of any rich data architecture - which I believe will include data silos, data warehouse(s) and a data lake. Tho' that IS a blog for another day!

Further reading:

https://aws.amazon.com/big-data/datalakes-and-analytics/what-is-a-data-lake/

https://www.forbes.com/sites/bernardmarr/2018/08/27/what-is-a-data-lake-a-super-simple-explanation-for-anyone/#425f021576e0

https://azure.microsoft.com/en-gb/solutions/data-lake/

https://www.datameer.com/blog/whats-data-lakes-five-questions-answered/

Great poster that explains benefits from analytics/BI/Data Science perspective http://bit.ly/2MMKIez