Data lakes and big data analytics: the what, why and how of data lakes

What are data lakes and how are they used for big data analytics? A definition and description of data lakes, how they work and what are their benefits, drivers and disadvantages.

To make a big data project succeed you need at least two things: knowing what (blended) actionable data you need for your desired outcomes and getting the right data to analyze and leverage in order to achieve those outcomes.

That much seems obvious. However, as you know we have ever more data coming from ever more sources and in ever more forms and shapes. Big data indeed. As you also know this volume of data, nor the variety and so forth are about to decline any time soon. Well on the contrary.

Data lakes - what why and how - big data analytics

Data lakes as a way to end data silos in a fast growing and increasingly unstructured big data universe

Just look at the IoT (Internet of Things) where mainly the Industrial Internet of Things is poised to grow fast the coming years.

And with that growth indeed comes more data or better: data is what we are after with the Internet of Things, in order to gain big insights and drive relevant actions and operations to achieve whatever outcome: big data analytics with a purpose; smart data for smart applications – and inevitably artificial intelligence to make sense of all that data.

A data lake is a place to put all the data enterprises (may) want to gather, store, analyze and turn into insights and action, including structured, semi-structured and unstructured data

Traditionally data has been residing in silos across the organization and the ecosystem in which it operations (external data). That’s a challenge: you can’t combine the right data to succeed in a big data project if that data is a bit everywhere in and out of the cloud.

This is, among others, where the idea – and reality – of (big) data lakes comes from. As a concept, the data lake was promoted by James Dixon, who was CTO at Pentaho and saw it as a better repository alternative for the big data reality than a data mart or data warehouse.

Here is how Dixon defined or explained the data lake on his blog in 2011: “If you think of a datamart as a store of bottled water – cleansed and packaged and structured for easy consumption – the data lake is a large body of water in a more natural state. The contents of the data lake stream in from a source to fill the lake, and various users of the lake can come to examine, dive in, or take samples”.

Data lakes are storage repositories with an analytics and action purpose

You can see a data lake indeed a bit like a lake, without the swans and the water. OK, that doesn’t look like a lake. But you get the idea: a big data lake in essence is a storage repository containing loads of data in their raw, native format.

Traditional data management approaches aren’t fit (or require a lot of money) to handle big data and big data analytics. With big data analytics essentially we want to find correlations between different data sets which need to be combined in order to achieve our business outcome. And if these data sets sit in entirely different systems, that’s virtually impossible.

An example of such a goal could be to combine data regarding a customer from one source with data from other sources and even seemingly unrelated data (for instance, traffic data, weather data, data on customers that seem non-related to our business) to act upon them to enhance the customer experience, come up with new services or simply sell more.

Bottom-up data analytics: ingestion to fill up the data lake

What does all this have to do with a data lake? Well, big data lakes are one of two information management approaches for analytics.

The first one is top-down (data warehousing), the second one is bottom-up, the data lake, the topic we’re covering here. To make this more tangible, let’s go back to the image of a real lake. A lake doesn’t just get filled like that. Usually there are rivers or smaller streams that bring water to it.

Data lakes are designed for big data analytics and to solve the data silo challenge in big data

In a data lake the same happens. This is also known as the ingestion of data, regardless of source or structure. We collect all the data we need to reach our goal through the mentioned data analytics.

These ‘streams’ of data come in several formats: structured data (simply said, data from a traditional relational database or even spreadsheet: rows and columns), unstructured data (social, video, email, text,…), data from all sorts of logs (e.g weblogs, clickstream analysis,…), XML, machine-to-machine, IoT and sensor data,, you name it (logs and XML are also called semi-structured data).

They also involve various types of data from a contextual perspective: customer data, data from line-of-business applications, sales data, etc. (entered in the data lake via APIs). And, obviously we increasingly have external data (sources) which we want to leverage in order to achieve our goals.

The usage of data lakes: storage, analytics, visualization and action

All this data, as far as it makes or could make sense, gets stored in the data lake while it also keeps coming in, via Application Interface Protocols (APIs), feeding data from all sorts of applications and systems, or via batch processes.

The storage dimension is the second big piece (ingestion being the first one). And in the big data lake approach this de facto means that there are no silos. This, in consequence, means that we are ready to start the interesting work: big data analytics.

To go back to our example of combining data sets which sometimes seem to be non-related we can for instance detect patterns (using artificial intelligence) between purchasing behavior and weather patterns, between customer data from one source and customer data from another, between traffic data and pollution data, the list goes on. We try to keep it simple. What can you do with these patterns? A lot as you can imagine and ample big data usage examples in real life show, that’s where your business or other objective comes in.

Obviously analyzing is not enough. You also need to visualize, understand and act upon what you have analyzed. Or as the infographic from EMC on how data lakes work below puts it: the outflow of the water is the analyzed data, which then leads to action which leads to business insights. It’s indeed our good old

Understanding data lakes - what is a data lake and how do data lakes work - infographic by EMC
Understanding data lakes – what is a data lake and how do data lakes work – infographic by EMC

Why data lakes? The benefits

As said there are traditionally two information management approaches for analytics. Why are data lakes (the bottom-up approach) popular for data analytics?

There are different reasons. First, it’s important to understand that our image of a data lake as lake isn’t entirely correct, it’s not just some bottom-up big chaotic data swamp (although it can become one) and there are quite some technologies, protocols and so forth involved. To use the image of the streams going into the lake: there are filters in place before the water actually goes into the lake.

The historical legacy data architecture challenge

Some reasons why data lakes are more popular are historical.

Traditional legacy data systems are not that open, to say the least, if you want to start integrating, adding and blending data together to analyze and act. Analytics with traditional data architectures weren’t that obvious nor cheap either (with the need for additional tools, depending on the software). Moreover, they weren’t built with all the new and emerging (external) data sources which we typically see in big data in mind.

Faster big data analytics as a driver of data lake adoption

Another important reason to use data lakes is the fact that big data analytics can be done faster.

In fact, data lakes are designed for big data analytics if you want and, more important than ever, for real-time actions based on real-time analytics. Data lakes are fit to leverage big quantities of data in a consistent way with algorithms to drive (real-time) analytics with fast data.

Mixing and converging data: structured and unstructured in one data lake

A benefit we more or less already mentioned is the possibility to acquire, blend, integrate and converge all types of data, regardless of sources and format.

Hadoop, one of the data lake architectures, can also deal with structured data on top of the main chunk of data: the previously mentioned unstructured data coming from social data, logs and so forth. On a side note: unstructured data is the fastest growing form of all data (even if structured data keeps growing too) and is predicted to reach about 90 percent of all data.

Moving data analytics to the source – the data lake and the edge

Then, there is the fact that moving large data sets back and forth isn’t exactly the smartest thing to do.

With big data lakes the applications are close to where the data resides. An interesting development in this sense is that you see the applications (or big data analytics) moving to the edge rather than to a storage repository to move even faster and take away the burden from networks, among others. This is the essence of fog computing, a more recent application of edge computing in the scope of data analytics in the connected factory context of Industry 4.0 and the Industrial Internet.

Data lakes and flexibility: grow and scale as you go

Next, data lakes are highly scalable and flexible. That doesn’t need too much elaboration. The system and processes can easily be scaled to deal with ever more data.

What is a data lake - according to PwC - source
What is a data lake – according to PwC – source

Saving enterprise data warehouse resources

A final benefit we mention is that, as the illustration from PwC above shows a data lake can serve as a staging area for the (enterprise) data warehouse (EDW).

It then is used to pass on relevant data only to the warehouse, whereby it can save EDW resources.

Data lake challenges, risks and evolutions

There are more benefits of big data lakes, yet as per usual we don’t want to get too technical. And, also as per usual, there are benefits, risks and challenges to address.

One of them is the mentioned risk that data lakes can become data swamps if not properly strategically designed with the necessary goals and cleaning in mind. This is also the reason why organizations move from the very traditional data lake approach to a goal-oriented and business-driven one.

Obviously, data lakes should be approached from a business-driven and strategic approach as such. However, historically they have often been seen from the rising data volume perspective and the notion that in the end all data have potential value.

Potential means in the future and, while that value is indeed, well, potential, quite some companies acted like data hoarders (which is by the way not uncommon in information management, it’s a key reason why there is still so much paper around and going paperless remains a pipe dream), without thinking too much about the data and metadata that really mattered (metadata management is an important element in a data lake context).

Moreover, there is the question if a data lake is needed for your organization and goals and, if so, if you can derive value from your data lake.

A 2015 survey by Gartner, showed that for several companies Hadoop (a leader for data lake architecture) was overkill and that skills gaps (to derive value from Hadoop) were the major inhibitor.

The size of big data lakes

Since big data volumes and usage keep growing with big data initiatives that are increasing in breadth, depth and inclusivity as Qubole puts in a blog, data lake sizes obviously keep growing too.

The blog post, which announces the 2018 Big Data Trends and Challenges report by Dimensional Research (sponsored by Qubole) points out that the percentage of organizations with average data lake sizes over 100 Terabytes has grown from 36% in 2017 to 44% in 2018 (an increase of 22% in one year). This trend will only continue and is just one of many drivers of the shift of big data processing to the cloud.

Average data lake sizes

Qubole in a press release on the September 2018 report: “The shift towards cloud is necessitated in part due to the ever-growing volume and diversity of data that companies are dealing with, as 44 percent of organizations now report working with massive data lakes over 100 terabytes in size”.

The report, which you can download here, reminds that forecasts call for future datasets that far exceed the sizes of today’s big data repositories.

The eternal challenge however remains: how to get value out of all that data. Decisions and actions indeed – that is and should be a key driver in how the market evolves.



Top image: Shutterstock – Copyright: GarryKillian – All other images are the property of their respective mentioned owners.