We’re living in a world awash with expanding amounts of data. Some of it has been generated by business intelligence workloads, and some of it is less structured content that’s produced during manufacturing processes, or by retail point-of-sale devices and an ever-growing number of mobile, intelligent devices. Then, of course, there is the Internet of Things, and its growing number of connected devices continuously streaming out increasing volumes of structured and unstructured data.
This huge wave of data is overwhelming many existing enterprise storage infrastructures, regardless of whether the intent is to store and process the data locally, in a cloud service provider’s data center, or in some combination of the two. “Data lakes” are designed to address this data storage challenge, making the data more useful and accessible, and still allowing enterprises to meet their security, privacy and data governance needs.
What is a data lake?
Data lakes are a developing entity, and the industry hasn’t coalesced around a single, universally accepted definition. A consensus definition, derived from the consultation of several different sources, follows:
“A data lake is a storage mechanism designed to facilitate the colocation and use of many different types of data, including data that is date-defined using various schemata, structural frameworks, blobs and other files.”
The hope is that a data lake will make it possible for an enterprise to gain new business insights by accumulating large amounts of data, in the format chosen by each workload, and then make it easy to process using big data analytics, cross-workload analysis, reporting, research, and even some forms of transactional workloads.
New tools, new thoughts
The movement toward implementation of data lakes is at the intersection of several trends. One is a move by cloud service providers who are seeking to innovate and provide new storage products.
Another trend sees enterprises experiencing fundamental shifts in the sources of their data and how they are using it. The data is now coming from many types of end user-focused devices and systems and is still being generated and processed by traditional systems. Efforts are underway to combine all of this structured and unstructured data, regardless of its form or original intent, making it easier to join with other systems of record. That’s where data lakes come in.
In addition, older approaches based on monolithic application and database design simply can’t offer the speed to keep up with consumer expectations, but they’re still being used to support legacy workloads. A data lake is a new tool to help developers deal with the tsunami of data coming from everywhere and deliver the on-demand performance expected by all users.
Finally, there’s the cloud. The horizontal scalability of cloud computing has introduced new database architectures allowing enterprises to build massive data lakes at hyperscale while maintaining the necessary data consistency across distributed environments.
Concerns about diving into data lakes
Some industry research firms have published notes or conference presentations that warn that enterprises shouldn’t dive into a data lake without proper planning. Some things to watch for include:
- Make sure providers are defining data lakes in a way that their tools and products really do serve your requirements.
- Consider the level of expertise or skills within your organization in data analysis and data manipulation in order to make the most optimum use of a data lake.
- Ensure your corporate data governance, security or privacy policies match-up with your data lake implementation.
- Test that the storage performance of data lakes meet the needs of all workloads.
A storage and interconnection solution for data storage demands
Data lakes may be an emerging enterprise tool, but the general need to address better ways to store and exploit the burgeoning amounts of data they can store is longstanding and only increasing in relevance. Equinix Data Hub offers a data storage and interconnection solution that enables the enterprise to move massive data stores ̶ including data lakes – closer to where their data is created or needs to be accessed by users, analytics and clouds.
Data Hub is a localized storage repository that can be easily deployed in 40 markets worldwide, so companies can safely store their data close to users, analytics engines and clouds for faster access and accelerated processing and insights. Data Hub also enables robust disaster recovery strategies and makes it easy to comply with regulations worldwide requiring companies to house data within certain borders.
Read the Equinix Data Hub solution brief to learn how to create high-performance and secure data lakes for enterprise workloads and applications.
And learn ways to prevent data lakes from turning into data swamps.