In the age of Big Data and Cloud Computing, information is everywhere and it’s a lot messier than it used to be. That’s why data lakes are so vital.
As businesses around the world abandon their reservations and embrace cloud technology, we are well and truly entering the world of big data. However, all that data has to go somewhere which is why, in the age of cloud computing, data lakes are becoming increasingly important.
The Data Explosion
Data is getting generated faster than ever before. 90% of all data ever created came into being in the last few years, but that’s nothing compared to what’s coming. Data is expected to increase tenfold by 2025 thanks to the triumph of digital technology. It’s coming from everywhere – from the internet of things connected devices, from phone conversations, social media and much more.
What’s more, data is becoming more complex. Data comes in both structured and unstructured formats. Structured might be simple information such as sales records which can come in uniform forms and are easy to store. Unstructured data can be audio or video files, or interactions in social media which are fluid and much more difficult to capture.
All that data has to go somewhere, which is where we come to the concept of a data lake. This is a repository of all data where it can be stored without being sorted or analyzed. It is simply sitting there waiting to be used.
Why it’s Important
This data lake can be incredibly valuable to your business. Within its depths can be found all sorts of insights into your company’s financial position, performance, and outlooks. If you can unlock the information it contains, you can open yourself up to a world of opportunities. A survey by Aberdeen found that those companies which deployed data lakes outperformed those which did not by 9%.
Data lakes differ significantly from data warehouses in which the structure of the information has already been defined. Here the data can be stored whether it’s data related to the business or unrelated data from areas such as social media. This can all be stored without defining which questions will have to be asked. It is essentially, a great big bottomless pit into which you can put as much data as you like.
Even so, that doesn’t mean you should just dump all your data in there in the hope that you might need it into the future. That can simply add to the problem of sorting through excessive data. All that data is only useful if it has not been compromised in some way. Likewise, because it has not been transformed, extracting this data will be more time consuming than would had it been stored in a data warehouse.
In many ways, you should think of your data lake as an area of loose storage behind a data warehouse where you’ve put data to forget about. You don’t know what’s in there or how valuable it will be, but you have an inkling it might come in handy at some point. The secret to making it work is controlling the chaos. Here’s how you do it.
Firstly, avoid dumping useless data. The flexibility of a data lake makes it tempting to simply dump all your data without giving a thought to whether you will need it. However, that will simply add to the difficulty of retrieving it when the time comes. To avoid this, you should implement a process to govern which data goes into the lake. Zones within the lake can maintain some degree of organization and make it a little easier to retrieve. Everything should revolve around your final goals. You must know what you need the data for, and what you hope it will achieve. This will help you define your data lake architecture. Many will simply use a big data platform such as Hadoop, but you need something which can work effectively for your business.
AWS and Azure Data Lakes
The latest data lake platforms are evolving to match the requirements of businesses. In particular, they aim to make it as easy as possible. This is the ethos behind the new AWS Lake Formation tool which was launched at their recent ReInvent. “Everybody is excited about data lakes,” said AWS CEO Andy Jassy in the keynote. “People realize that there is significant value in moving all that disparate data that lives in your company in different silos and make it much easier by consolidating it in a data lake.”
It automates many of the key functions of setting up a data lake. With conventional technologies, you would have had to configure storage on S3 buckets, move the data, add metadata, add it to the catalog, clean up that data and set the right security policies. It’s a lot of work and can take several months and, in the fast-moving modern world, that might be several months too long. The approach from AWS is to do all of that for you and condense all those complications into just a few clicks. It takes the formation of a data lake down from a few months to a few days. Everything happens with just a few clicks from a main dashboard. Users can ensure key security protocols have been hit and maintain access controls.
At the same event AWS previewed AWS Tower which helps users govern a multi account environment or landing zone. It is designed to cope with a new type of builder which they believe is emerging. While all builders are tinkerers by nature, this new type is less concerned with the detail – they want to build and implement more quickly. They are asking for, and getting a host of new innovations which help them speed up their work and manage security. Lake Formation is another new service from AWS that makes it easy to set up a secure data lake in days.Creating a data lake with Lake Formation is as simple as defining where your data resides and what data access and security policies you want to apply.
Microsoft’s Azure also allows you to create data lakes. By storing a data lake on the cloud in this way you leverage a number of advantages including greater performance, better flexibility and a central repository from which to manage all functions. They make it possible for companies to transfer complex operations such as payments processing onto the cloud, enhancing the range of services they can deliver to their customers.
Data and Beyond
Increasingly, we are moving towards an environment of data as a service. Users increasingly want on-demand options for their data including:
- Data virtualization: the ability to access data sets in multiple repositories
- Data acceleration: faster and more interactive access to larger data sets
- Data curation: tools to quickly blend multiple data sets for specific tasks.
This is where technology is heading. It understands the benefit of the data lake in that it contains a huge amount of unstructured data which could have all sorts of uses, but it also understands the challenges – namely that all that data can be difficult to retrieve. To make them work you have to find a way of sorting into the clutter, delving into the depths of the lake and retrieving those nuggets of usable data. If you do that you’ll find yourself with a significant edge against the competition.
Market Research Team, RapidValue Solutions